scholarly journals Prioritizing bona fide bacterial small RNAs with machine learning classifiers

Author(s):  
Erik JJ Eppenhof ◽  
Lourdes Peña-Castillo

Bacterial small non-coding RNAs (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate in the wet lab each of them. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All sevens features used in the classification task contributed positively to the performance of the predictive models. The five best performing models obtained a median precision of 100% at 10% recall and of 60% at 40% recall across all five bacterial species. Our results suggest that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features of sRNAs that are conserved across taxa. We show that these features are exploited by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.

2018 ◽  
Author(s):  
Erik JJ Eppenhof ◽  
Lourdes Peña-Castillo

Bacterial small non-coding RNAs (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate in the wet lab each of them. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All sevens features used in the classification task contributed positively to the performance of the predictive models. The five best performing models obtained a median precision of 100% at 10% recall and of 60% at 40% recall across all five bacterial species. Our results suggest that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features of sRNAs that are conserved across taxa. We show that these features are exploited by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.


PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6304 ◽  
Author(s):  
Erik J.J. Eppenhof ◽  
Lourdes Peña-Castillo

Bacterial small (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate each of them in the wet lab. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All seven features used in the classification task contributed positively to the performance of the predictive models. The best performing model obtained a median precision of 100% at 10% recall and of 64% at 40% recall across all five bacterial species, and it outperformed previous published approaches on two benchmark datasets in terms of precision and recall. Our results indicate that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features in the genomic context of sRNAs that are conserved across taxa. We show that these features are utilized by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.


2020 ◽  
Vol 6 (4) ◽  
pp. 41
Author(s):  
Mihnea P. Dragomir ◽  
Ganiraju C. Manyam ◽  
Leonie Florence Ott ◽  
Léa Berland ◽  
Erik Knutsen ◽  
...  

Non-coding RNAs (ncRNAs) are essential players in many cellular processes, from normal development to oncogenic transformation. Initially, ncRNAs were defined as transcripts that lacked an open reading frame (ORF). However, multiple lines of evidence suggest that certain ncRNAs encode small peptides of less than 100 amino acids. The sequences encoding these peptides are known as small open reading frames (smORFs), many initiating with the traditional AUG start codon but terminating with atypical stop codons, suggesting a different biogenesis. The ncRNA-encoded peptides (ncPEPs) are gradually becoming appreciated as a new class of functional molecules that contribute to diverse cellular processes, and are deregulated in different diseases contributing to pathogenesis. As multiple publications have identified unique ncPEPs, we appreciated the need for assembling a new web resource that could gather information about these functional ncPEPs. We developed FuncPEP, a new database of functional ncRNA encoded peptides, containing all experimentally validated and functionally characterized ncPEPs. Currently, FuncPEP includes a comprehensive annotation of 112 functional ncPEPs and specific details regarding the ncRNA transcripts that encode these peptides. We believe that FuncPEP will serve as a platform for further deciphering the biologic significance and medical use of ncPEPs. The link for FuncPEP database can be found at the end of the Introduction Section.


2019 ◽  
Vol 32 (9) ◽  
pp. 1067-1076 ◽  
Author(s):  
Javier F. Tabima ◽  
Niklaus J. Grünwald

Effectors are small, secreted proteins that facilitate infection of host plants by all major groups of plant pathogens. Effector protein identification in oomycetes relies on identification of open reading frames with certain amino acid motifs among additional minor criteria. To date, identification of effectors relies on custom scripts to identify motifs in candidate open reading frames. Here, we developed the R package effectR, which provides a convenient tool for rapid prediction of effectors in oomycete genomes, or with custom scripts for any genome, in a reproducible way. The effectR package relies on a combination of regular expressions statements and hidden Markov model approaches to predict candidate RxLR and crinkler effectors. Other custom motifs for novel effectors can easily be implemented and added to package updates. The effectR package has been validated with published oomycete genomes. This package provides a convenient tool for wet lab researchers interested in reproducible identification of candidate effectors in oomycete genomes.


2001 ◽  
Vol 183 (2) ◽  
pp. 443-450 ◽  
Author(s):  
Jolanta Vitkute ◽  
Kornelijus Stankevicius ◽  
Giedre Tamulaitiene ◽  
Zita Maneliene ◽  
Albertas Timinskas ◽  
...  

ABSTRACT Methyltransferases (MTases) of procaryotes affect general cellular processes such as mismatch repair, regulation of transcription, replication, and transposition, and in some cases may be essential for viability. As components of restriction-modification systems, they contribute to bacterial genetic diversity. The genome ofHelicobacter pylori strain 26695 contains 25 open reading frames encoding putative DNA MTases. To assess which MTase genes are active, strain 26695 genomic DNA was tested for cleavage by 147 restriction endonucleases; 24 were found that did not cleave this DNA. The specificities of 11 expressed MTases and the genes encoding them were identified from this restriction data, combined with the known sensitivities of restriction endonucleases to specific DNA modification, homology searches, gene cloning and genomic mapping of the methylated bases m4C, m5C, and m6A.


2007 ◽  
Vol 6 (11) ◽  
pp. 2102-2111 ◽  
Author(s):  
Javier Botet ◽  
Laura Mateos ◽  
José L. Revuelta ◽  
María A. Santos

ABSTRACT Large-scale phenotypic analyses have proved to be useful strategies in providing functional clues about the uncharacterized yeast genes. We used here a chemogenomic profiling of yeast deletion collections to identify the core of cellular processes challenged by treatment with the p-aminobenzoate/folate antimetabolite sulfanilamide. In addition to sulfanilamide-hypersensitive mutants whose deleted genes can be categorized into a number of groups, including one-carbon related metabolism, vacuole biogenesis and vesicular transport, DNA metabolic and cell cycle processes, and lipid and amino acid metabolism, two uncharacterized open reading frames (YHI9 and YMR289w) were also identified. A detailed characterization of YMR289w revealed that this gene was required for growth in media lacking p-aminobenzoic or folic acid and encoded a 4-amino-4-deoxychorismate lyase, which is the last of the three enzymatic activities required for p-aminobenzoic acid biosynthesis. In light of these results, YMR289w was designated ABZ2, in accordance with the accepted nomenclature. ABZ2 was able to rescue the p-aminobenzoate auxotrophy of an Escherichia coli pabC mutant, thus demonstrating that ABZ2 and pabC are functional homologues. Phylogenetic analyses revealed that Abz2p is the founder member of a new group of fungal 4-amino-4-deoxychorismate lyases that have no significant homology to its bacterial or plant counterparts. Abz2p appeared to form homodimers and dimerization was indispensable for its catalytic activity.


PLoS ONE ◽  
2016 ◽  
Vol 11 (10) ◽  
pp. e0165429 ◽  
Author(s):  
Julia Hahn ◽  
Olga V. Tsoy ◽  
Sebastian Thalmann ◽  
Jelena Čuklina ◽  
Mikhail S. Gelfand ◽  
...  

2020 ◽  
Author(s):  
Sebastien A. Choteau ◽  
Audrey Wagner ◽  
Philippe Pierre ◽  
Lionel Spinelli ◽  
Christine Brun

ABSTRACTThe development of high-throughput technologies revealed the existence of non-canonical short open reading frames (sORFs) on most eukaryotic RNAs. They are ubiquitous genetic elements highly conserved across species and suspected to be involved in numerous cellular processes. MetamORF (http://metamorf.hb.univ-amu.fr/) aims to provide a repository of unique sORFs identified in the human and mouse genomes with both experimental and computational approaches. By gathering publicly available sORF data, normalizing it and summarizing redundant information, we were able to identify a total of 1,162,675 unique sORFs. Despite the usual characterization of ORFs as short, upstream or downstream, there is currently no clear consensus regarding the definition of these categories. Thus, the data has been reprocessed using a normalized nomenclature. MetamORF enables new analyses at loci, gene, transcript and ORF levels, that should offer the possibility to address new questions regarding sORF functions in the future. The repository is available through an user-friendly web interface, allowing easy browsing, visualization, filtering over multiple criteria and export possibilities. sORFs could be searched starting from a gene, a transcript, an ORF ID, or looking in a genome area. The database content has also been made available through track hubs at UCSC Genome Browser.


2013 ◽  
Vol 11 (05) ◽  
pp. 1342002 ◽  
Author(s):  
ASHIS KUMER BISWAS ◽  
BAOJU ZHANG ◽  
XIAOYONG WU ◽  
JEAN X. GAO

The statistics about the open reading frames, the base compositions and the properties of the predicted secondary structures have potential to address the problem of discriminating coding and noncoding transcripts. Again, the Next Generation Sequencing platform, RNA-seq, provides us bounty of data from which expression profiles of the transcripts can be extracted which urged us adding a new set of dimension in this classification task. In this paper, we proposed CNCTDiscriminator — a coding and noncoding transcript discriminating system where we applied the integration of these four categories of features about the transcripts. The feature integration was done using both hypothesis learning and feature specific ensemble learning approaches. The CNCTDiscriminator model which was trained with composition and ORF features outperforms (precision 83.86%, recall 82.01%) other three popular methods — CPC (precision 98.31%, recall 25.95%), CPAT (precision 97.74%, recall 52.50%) and PORTRAIT (precision 84.37%, recall 73.2%) when applied to an independent benchmark dataset. However, the CNCTDiscriminator model that was trained using the ensemble approach shows comparable performance (precision 89.85%, recall 71.08%).


Sign in / Sign up

Export Citation Format

Share Document