scholarly journals effectR: An Expandable R Package to Predict Candidate RxLR and CRN Effectors in Oomycetes Using Motif Searches

2019 ◽  
Vol 32 (9) ◽  
pp. 1067-1076 ◽  
Author(s):  
Javier F. Tabima ◽  
Niklaus J. Grünwald

Effectors are small, secreted proteins that facilitate infection of host plants by all major groups of plant pathogens. Effector protein identification in oomycetes relies on identification of open reading frames with certain amino acid motifs among additional minor criteria. To date, identification of effectors relies on custom scripts to identify motifs in candidate open reading frames. Here, we developed the R package effectR, which provides a convenient tool for rapid prediction of effectors in oomycete genomes, or with custom scripts for any genome, in a reproducible way. The effectR package relies on a combination of regular expressions statements and hidden Markov model approaches to predict candidate RxLR and crinkler effectors. Other custom motifs for novel effectors can easily be implemented and added to package updates. The effectR package has been validated with published oomycete genomes. This package provides a convenient tool for wet lab researchers interested in reproducible identification of candidate effectors in oomycete genomes.

2018 ◽  
Author(s):  
Javier F. Tabima ◽  
Niklaus J. Grünwald

ABSTRACTEffectors are by one definition small, secreted proteins that facilitate infection of host plants by all major groups of plant pathogens. Effector protein identification in oomycetes relies on identification of open reading frames with certain amino acid motifs among additional minor criteria. To date, identification of effectors relies on custom scripts to identify motifs in candidate open reading frames. Here, we developed the R package effectR that provides a convenient tool for rapid prediction of effectors in oomycete genomes, or with custom scripts for any genome, in a reproducible way. The effectR package relies on a combination of regular expressions statements and hidden Markov model approaches to predict candidate RxLR and CRN effectors. Other custom motifs for novel effectors can easily be implemented and added to package updates. The effectR package has been validated with published oomycete genomes. This package provides a convenient tool for reproducible identification of candidate effectors in oomycete genomes.


2021 ◽  
Author(s):  
Vasily V. Grinev ◽  
Mikalai M. Yatskou ◽  
Victor V. Skakun ◽  
Maryna K. Chepeleva ◽  
Petr V. Nazarov

AbstractMotivationModern methods of whole transcriptome sequencing accurately recover nucleotide sequences of RNA molecules present in cells and allow for determining their quantitative abundances. The coding potential of such molecules can be estimated using open reading frames (ORF) finding algorithms, implemented in a number of software packages. However, these algorithms show somewhat limited accuracy, are intended for single-molecule analysis and do not allow selecting proper ORFs in the case of long mRNAs containing multiple ORF candidates.ResultsWe developed a computational approach, corresponding machine learning model and a package, dedicated to automatic identification of the ORFs in large sets of human mRNA molecules. It is based on vectorization of nucleotide sequences into features, followed by classification using a random forest. The predictive model was validated on sets of human mRNA molecules from the NCBI RefSeq and Ensembl databases and demonstrated almost 95% accuracy in detecting true ORFs. The developed methods and pre-trained classification model were implemented in a powerful ORFhunteR computational tool that performs an automatic identification of true ORFs among large set of human mRNA molecules.Availability and implementationThe developed open-source R package ORFhunteR is available for the community at GitHub repository (https://github.com/rfctbio-bsu/ORFhunteR), from Bioconductor (https://bioconductor.org/packages/devel/bioc/html/ORFhunteR.html) and as a web application (http://orfhunter.bsu.by).


2006 ◽  
Vol 19 (1) ◽  
pp. 69-79 ◽  
Author(s):  
Dean W. Gabriel ◽  
Caitilyn Allen ◽  
Mark Schell ◽  
Timothy P. Denny ◽  
Jean T. Greenberg ◽  
...  

An 8× draft genome was obtained and annotated for Ralstonia solanacearum race 3 biovar 2 (R3B2) strain UW551, a United States Department of Agriculture Select Agent isolated from geranium. The draft UW551 genome consisted of 80,169 reads resulting in 582 contigs containing 5,925,491 base pairs, with an average 64.5% GC content. Annotation revealed a predicted 4,454 protein coding open reading frames (ORFs), 43 tRNAs, and 5 rRNAs; 2,793 (or 62%) of the ORFs had a functional assignment. The UW551 genome was compared with the published genome of R. solanacearum race 1 biovar 3 tropical tomato strain GMI1000. The two phylogenetically distinct strains were at least 71% syntenic in gene organization. Most genes encoding known pathogenicity determinants, including predicted type III secreted effectors, appeared to be common to both strains. A total of 402 unique UW551 ORFs were identified, none of which had a best hit or >45% amino acid sequence identity with any R. solanacearum predicted protein; 16 had strong (E < 10-13) best hits to ORFs found in other bacterial plant pathogens. Many of the 402 unique genes were clustered, including 5 found in the hrp region and 38 contiguous, potential prophage genes. Conservation of some UW551 unique genes among R3B2 strains was examined by polymerase chain reaction among a group of 58 strains from different races and biovars, resulting in the identification of genes that may be potentially useful for diagnostic detection and identification of R3B2 strains. One 22-kb region that appears to be present in GMI1000 as a result of horizontal gene transfer is absent from UW551 and encodes enzymes that likely are essential for utilization of the three sugar alcohols that distinguish biovars 3 and 4 from biovars 1 and 2.


1999 ◽  
Vol 181 (10) ◽  
pp. 3155-3163 ◽  
Author(s):  
M. Gita Bangera ◽  
Linda S. Thomashow

The polyketide metabolite 2,4-diacetylphloroglucinol (2,4-DAPG) is produced by many strains of fluorescent Pseudomonas spp. with biocontrol activity against soilborne fungal plant pathogens. Genes required for 2,4-DAPG synthesis by P. fluorescensQ2-87 are encoded by a 6.5-kb fragment of genomic DNA that can transfer production of 2,4-DAPG to 2,4-DAPG-nonproducing recipientPseudomonas strains. In this study the nucleotide sequence was determined for the 6.5-kb fragment and flanking regions of genomic DNA from strain Q2-87. Six open reading frames were identified, four of which (phlACBD) comprise an operon that includes a set of three genes (phlACB) conserved between eubacteria and archaebacteria and a gene (phlD) encoding a polyketide synthase with homology to chalcone and stilbene synthases from plants. The biosynthetic operon is flanked on either side by phlEand phlF, which code respectively for putative efflux and regulatory (repressor) proteins. Expression in Escherichia coli of phlA, phlC, phlB, andphlD, individually or in combination, identified a novel polyketide biosynthetic pathway in which PhlD is responsible for the production of monoacetylphloroglucinol (MAPG). PhlA, PhlC, and PhlB are necessary to convert MAPG to 2,4-DAPG, and they also may function in the synthesis of MAPG.


2003 ◽  
Vol 185 (22) ◽  
pp. 6513-6521 ◽  
Author(s):  
Sharon Melamed ◽  
Edna Tanne ◽  
Raz Ben-Haim ◽  
Orit Edelbaum ◽  
David Yogev ◽  
...  

ABSTRACT Phytoplasmas are unculturable, insect-transmissible plant pathogens belonging to the class Mollicutes. To be transmitted, the phytoplasmas replicate in the insect body and are delivered to the insect's salivary glands, from where they are injected into the recipient plant. Because phytoplasmas cannot be cultured, any attempt to recover phytoplasmal DNA from infected plants or insects has resulted in preparations with a large background of host DNA. Thus, studies of the phytoplasmal genome have been greatly hampered, and aside from the rRNA genes, only a few genes have hitherto been isolated and characterized. We developed a unique method to obtain host-free phytoplasmal genomic DNA from the insect vector's saliva, and we demonstrated the feasibility of this method by isolating and characterizing 78 new putative phytoplasmal open reading frames and their deduced proteins. Based on the newly accumulated information on phytoplasmal genes, preliminary characteristics of the phytoplasmal genome are discussed.


Author(s):  
Yating Liu ◽  
Joseph D Dougherty

Abstract Summary Whole genome sequencing of patient populations is identifying thousands of new variants in UnTranslated Regions(UTRs). While the consequences of UTR mutations are not as easily predicted from primary sequence as coding mutations are, there are some known features of UTRs that modulate their function. utr.annotation is an R package that can be used to annotate potential deleterious variants in the UTR regions for both human and mouse species. Given a CSV or VCF format variant file, utr.annotation provides information of each variant on whether and how it alters known translational regulators including: upstream Open Reading Frames (uORFs), upstream Kozak sequences, polyA signals, Kozak sequences at the annotated translation start site, start codons, and stop codons, conservation scores in the variant position, and whether and how it changes ribosome loading based on a model derived from empirical data. Availability utr.annotation is freely available on Bitbucket (https://bitbucket.org/jdlabteam/utr.annotation/src/master/) and CRAN (https://cran.r-project.org/web/packages/utr.annotation/index.html) Supplementary information Supplementary data are available at https://wustl.box.com/s/yye99bryfin89nav45gv91l5k35fxo7z.


2018 ◽  
Author(s):  
Erik JJ Eppenhof ◽  
Lourdes Peña-Castillo

Bacterial small non-coding RNAs (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate in the wet lab each of them. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All sevens features used in the classification task contributed positively to the performance of the predictive models. The five best performing models obtained a median precision of 100% at 10% recall and of 60% at 40% recall across all five bacterial species. Our results suggest that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features of sRNAs that are conserved across taxa. We show that these features are exploited by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.


2018 ◽  
Author(s):  
Erik JJ Eppenhof ◽  
Lourdes Peña-Castillo

Bacterial small non-coding RNAs (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate in the wet lab each of them. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All sevens features used in the classification task contributed positively to the performance of the predictive models. The five best performing models obtained a median precision of 100% at 10% recall and of 60% at 40% recall across all five bacterial species. Our results suggest that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features of sRNAs that are conserved across taxa. We show that these features are exploited by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.


Sign in / Sign up

Export Citation Format

Share Document