MiPepid: MicroPeptide identification tool using machine learning

Abstract Background Micropeptides are small proteins with length < = 100 amino acids. Short open reading frames that could produces micropeptides were traditionally ignored due to technical difficulties, as few small peptides had been experimentally confirmed. In the past decade, a growing number of micropeptides have been shown to play significant roles in vital biological activities. Despite the increased amount of data, we still lack bioinformatics tools for specifically identifying micropeptides from DNA sequences. Indeed, most existing tools for classifying coding and noncoding ORFs were built on datasets in which “normal-sized” proteins were considered to be positives and short ORFs were generally considered to be noncoding. Since the functional and biophysical constraints on small peptides are likely to be different from those on “normal” proteins, methods for predicting short translated ORFs must be trained independently from those for longer proteins. Results In this study, we have developed MiPepid, a machine-learning tool specifically for the identification of micropeptides. We trained MiPepid using carefully cleaned data from existing databases and used logistic regression with 4-mer features. With only the sequence information of an ORF, MiPepid is able to predict whether it encodes a micropeptide with 96% accuracy on a blind dataset of high-confidence micropeptides, and to correctly classify newly discovered micropeptides not included in either the training or the blind test data. Compared with state-of-the-art coding potential prediction methods, MiPepid performs exceptionally well, as other methods incorrectly classify most bona fide micropeptides as noncoding. MiPepid is alignment-free and runs sufficiently fast for genome-scale analyses. It is easy to use and is available at https://github.com/MindAI/MiPepid. Conclusions MiPepid was developed to specifically predict micropeptides, a category of proteins with increasing significance, from DNA sequences. It shows evident advantages over existing coding potential prediction methods on micropeptide identification. It is ready to use and runs fast.

Download Full-text

MiPepid: MicroPeptide identification tool using machine learning

10.21203/rs.2.9710/v2 ◽

2019 ◽

Author(s):

Mengmeng Zhu ◽

Michael Gribskov

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

Biological Activities ◽

Open Reading Frames ◽

Prediction Methods ◽

Sequence Information ◽

Small Peptides ◽

Blind Test ◽

Small Proteins ◽

Coding Potential

Abstract Background Micropeptides are small proteins with a length of <= 100 amino acids. Short open reading frames that could produces micropeptides were traditionally ignored due to technical difficulties, as few small peptides had been experimentally demonstrated. In the past decade, a growing number of micropeptides have been shown to play significant roles in vital biological activities. Despite the increased amount of data, we still lack bioinformatics tools for specifically identifying micropeptides from DNA sequences. Indeed, most existing tools for classifying coding and noncoding ORFs were built on datasets in which “normal-sized” proteins were considered to be positives and short ORFs were generally considered to be noncoding. Since the functional and biophysical constraints on small peptides are likely to be different from those on “normal” proteins, methods for predicting short translated ORFs must be trained independently from those for longer proteins.Results In this study, we have developed MiPepid, a machine-learning tool specifically for the identification of micropeptides. We trained MiPepid using carefully cleaned data from existing databases and used logistic regression with 4-mer features. With only the sequence information of an ORF, MiPepid is able to predict whether it encodes a micropeptide with 96% accuracy on a blind dataset of high-confidence micropeptides, and to correctly classify newly discovered micropeptides not included in either the training or the blind test data. Compared with state-of-the-art coding potential prediction methods, MiPepid performs exceptionally well, as other methods incorrectly classify most bona fide micropeptides as noncoding. MiPepid is alignment-free and runs sufficiently fast for genome-scale analyses. It is easy to use and is available at https://github.com/MindAI/MiPepid.Conclusions MiPepid was developed to specifically predict micropeptides, a category of proteins with increasing significance, from DNA sequences. It shows evident advantages over existing coding potential prediction methods on micropeptide identification. It is ready to use and runs fast. keywords: micropeptide, small ORF, sORF, smORF, coding, noncoding, lncRNA, machine learning

Download Full-text

MiPepid: MicroPeptide identification tool using machine learning

10.21203/rs.2.9710/v1 ◽

2019 ◽

Author(s):

Mengmeng Zhu ◽

Michael Gribskov

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

Biological Activities ◽

Prediction Methods ◽

Sequence Information ◽

Blind Test ◽

Small Proteins ◽

Bona Fide ◽

Genome Scale ◽

Coding Potential

Abstract Background Micropeptides are small proteins with a length <= 100 amino acids. They were traditionally ignored as few were discovered due to technical difficulties. In the past decade, a growing number of micropeptides have been shown to play significant roles in vital biological activities. Despite the increased amount of data, we still lack bioinformatics tools specifically for identifying micropeptides from DNA sequences. Indeed, most existing tools for classifying coding and noncoding ORFs were built on datasets in which “normal-sized” proteins are considered to be positives and short ORFs are generally considered to be noncoding. Since the functional and biophysical constraints on small peptides are likely to be different from those on “normal” proteins, methods for predicting short translated ORFs must be trained independently from those for longer proteins. Results In this study, we developed MiPepid, a machine-learning tool specifically for the identification of micropeptides. We trained MiPepid using carefully cleaned data from existing databases and logistic regression with 4-mer features. With only the sequence information of an ORF, MiPepid is able to predict whether it encodes a micropeptide with 96% accuracy on a blind dataset of high-confidence micropeptides, and to correctly classify newly discovered micropeptides not included in either the training or the blind test data. Compared with state-of-the-art coding potential prediction methods, MiPepid performs exceptionally well, as other methods incorrectly classify most bona fide micropeptides as noncoding. MiPepid is alignment-free and runs sufficiently fast for genome-scale analyses. It is easy to use and is available at https://github.com/MindAI/MiPepid. Conclusion MiPepid was developed to specifically predict micropeptides, a category of proteins with increasing significance, from DNA sequences. It shows evident advantages over existing coding potential prediction methods on micropeptide identification. It is ready to use and runs fast.

Download Full-text

OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques

Database ◽

10.1093/database/baaa067 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Fabio R. Cerqueira ◽

Ana Tereza Ribeiro Vasconcelos

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Open Reading Frames ◽

Machine Learning Techniques ◽

Bacterial Genomes ◽

Small Proteins ◽

Learning Techniques ◽

Speed Up ◽

Computational Procedures ◽

Small Orfs

Abstract Small open reading frames (ORFs) have been systematically disregarded by automatic genome annotation. The difficulty in finding patterns in tiny sequences is the main reason that makes small ORFs to be overlooked by computational procedures. However, advances in experimental methods show that small proteins can play vital roles in cellular activities. Hence, it is urgent to make progress in the development of computational approaches to speed up the identification of potential small ORFs. In this work, our focus is on bacterial genomes. We improve a previous approach to identify small ORFs in bacteria. Our method uses machine learning techniques and decoy subject sequences to filter out spurious ORF alignments. We show that an advanced multivariate analysis can be more effective in terms of sensitivity than applying the simplistic and widely used e-value cutoff. This is particularly important in the case of small ORFs for which alignments present higher e-values than usual. Experiments with control datasets show that the machine learning algorithms used in our method to curate significant alignments can achieve average sensitivity and specificity of 97.06% and 99.61%, respectively. Therefore, an important step is provided here toward the construction of more accurate computational tools for the identification of small ORFs in bacteria.

Download Full-text

When Long Noncoding Becomes Protein Coding

Molecular and Cellular Biology ◽

10.1128/mcb.00528-19 ◽

2020 ◽

Vol 40 (6) ◽

Cited By ~ 14

Author(s):

Corrine Corrina R. Hartford ◽

Ashish Lal

Keyword(s):

Cell Division ◽

Cell Signaling ◽

Transcription Regulation ◽

Noncoding Rnas ◽

Long Noncoding Rnas ◽

Open Reading Frames ◽

Protein Coding ◽

Small Proteins ◽

Coding Potential ◽

Reading Frames

ABSTRACT Recent advancements in genetic and proteomic technologies have revealed that more of the genome encodes proteins than originally thought possible. Specifically, some putative long noncoding RNAs (lncRNAs) have been misannotated as noncoding. Numerous lncRNAs have been found to contain short open reading frames (sORFs) which have been overlooked because of their small size. Many of these sORFs encode small proteins or micropeptides with fundamental biological importance. These micropeptides can aid in diverse processes, including cell division, transcription regulation, and cell signaling. Here we discuss strategies for establishing the coding potential of putative lncRNAs and describe various functions of known micropeptides.

Download Full-text

A proteogenomics workflow to uncover the world of small proteins in Staphylococcus aureus

10.1101/2020.05.25.114132 ◽

2020 ◽

Author(s):

Stephan Fuchs ◽

Martin Kucklick ◽

Erik Lehmann ◽

Alexander Beckmann ◽

Maya Wilkens ◽

...

Keyword(s):

Staphylococcus Aureus ◽

Genome Annotation ◽

Genome Mapping ◽

Stationary Growth Phase ◽

Open Reading Frames ◽

Sequence Information ◽

Bacterial Physiology ◽

Small Proteins ◽

Genome Annotations ◽

Species Specific

AbstractSmall proteins play diverse and essential roles in bacterial physiology and virulence. Despite their importance, automated genome annotation algorithms still cannot accurately annotate all respective small open reading frames (sORFs), as they usually provide insufficient sequence information for domain and homology searches, tend to be species specific and only a few experimentally validated examples are covered in standard proteomics studies. The accuracy and reliability of genome annotations, particularly for sORFs, can be significantly improved by integrating protein evidence from experimental approaches that enrich for small proteins. Here we present a highly optimized and flexible workflow for bacterial proteogenomics, which covers all steps from (i) creation of protein databases, (ii) database searches, (iii) peptide-to-genome mapping to (iv) result interpretation and whose automated execution is supported by two open source tools (SALT & Pepper). We used the workflow to identify high quality peptide spectrum matches (PSMs) for both annotated and unannotated small proteins (≤ 100 aa; SP100) in Staphylococcus aureus Newman. Proteins isolated from cells at the exponential and stationary growth phase were digested with different endopeptidases (trypsin, Lys-C, AspN), the resulting peptides fractionated by gel-based and gel-free methods and measured with highly sensitive mass spectrometers. PSMs or sORF predictions from sORFfinder were stringently filtered allowing us to detect 185 soluble SP100, 69 of which were missing in the used genome annotation. Most interestingly, almost half of the identified SP100 were basic, suggesting a role in binding to more acidic molecules such as nucleic acids or phospholipids. In addition, phage-related functions were proposed for 30 SP100, based on the localization of their coding sequences in the genome.

Download Full-text

Conserved Regulation of Cardiac Calcium Uptake by Peptides Encoded in Small Open Reading Frames

Science ◽

10.1126/science.1238802 ◽

2013 ◽

Vol 341 (6150) ◽

pp. 1116-1120 ◽

Cited By ~ 188

Author(s):

Emile G. Magny ◽

Jose Ignacio Pueyo ◽

Frances M.G. Pearl ◽

Miguel Angel Cespedes ◽

Jeremy E. Niven ◽

...

Keyword(s):

Amino Acids ◽

Muscle Contraction ◽

Dna Sequences ◽

Calcium Transport ◽

Calcium Uptake ◽

Open Reading Frames ◽

Insect Development ◽

Small Peptides ◽

Reading Frames ◽

Small Open Reading Frames

Small open reading frames (smORFs) are short DNA sequences that are able to encode small peptides of less than 100 amino acids. Study of these elements has been neglected despite thousands existing in our genomes. We and others previously showed that peptides as short as 11 amino acids are translated and provide essential functions during insect development. Here, we describe two peptides of less than 30 amino acids regulating calcium transport, and hence influencing regular muscle contraction, in the Drosophila heart. These peptides seem conserved for more than 550 million years in a range of species from flies to humans, in which they have been implicated in cardiac pathologies. Such conservation suggests that the mechanisms for heart regulation are ancient and that smORFs may be a fundamental genome component that should be studied systematically.

Download Full-text

A novel artificial intelligence-based approach for identification of deoxynucleotide aptamers

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009247 ◽

2021 ◽

Vol 17 (8) ◽

pp. e1009247

Author(s):

Frances L. Heredia ◽

Abiel Roche-Lima ◽

Elsie I. Parés-Matos

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Support Vector Machines ◽

Dna Binding ◽

Dna Sequences ◽

Support Vector ◽

Sequence Information ◽

Dna Aptamers ◽

Vector Machines ◽

Selection Of

The selection of a DNA aptamer through the Systematic Evolution of Ligands by EXponential enrichment (SELEX) method involves multiple binding steps, in which a target and a library of randomized DNA sequences are mixed for selection of a single, nucleotide-specific molecule. Usually, 10 to 20 steps are required for SELEX to be completed. Throughout this process it is necessary to discriminate between true DNA aptamers and unspecified DNA-binding sequences. Thus, a novel machine learning-based approach was developed to support and simplify the early steps of the SELEX process, to help discriminate binding between DNA aptamers from those unspecified targets of DNA-binding sequences. An Artificial Intelligence (AI) approach to identify aptamers were implemented based on Natural Language Processing (NLP) and Machine Learning (ML). NLP method (CountVectorizer) was used to extract information from the nucleotide sequences. Four ML algorithms (Logistic Regression, Decision Tree, Gaussian Naïve Bayes, Support Vector Machines) were trained using data from the NLP method along with sequence information. The best performing model was Support Vector Machines because it had the best ability to discriminate between positive and negative classes. In our model, an Accuracy (A) of 0.995, the fraction of samples that the model correctly classified, and an Area Under the Receiving Operating Curve (AUROC) of 0.998, the degree by which a model is capable of distinguishing between classes, were observed. The developed AI approach is useful to identify potential DNA aptamers to reduce the amount of rounds in a SELEX selection. This new approach could be applied in the design of DNA libraries and result in a more efficient and faster process for DNA aptamers to be chosen during SELEX.

Download Full-text

Protein Inter-Residue Contacts Prediction: Methods, Performances and Applications

Current Bioinformatics ◽

10.2174/1574893613666181109130430 ◽

2019 ◽

Vol 14 (3) ◽

pp. 178-189 ◽

Cited By ~ 3

Author(s):

Xiaoyang Jing ◽

Qimin Dong ◽

Ruqian Lu ◽

Qiwen Dong

Keyword(s):

Machine Learning ◽

Protein Structure ◽

Tertiary Structure ◽

Prediction Methods ◽

Learning Methods ◽

Typical Application ◽

Machine Learning Methods ◽

Residue Contacts ◽

Fusion Methods ◽

Correlated Mutations

Background:Protein inter-residue contacts prediction play an important role in the field of protein structure and function research. As a low-dimensional representation of protein tertiary structure, protein inter-residue contacts could greatly help de novo protein structure prediction methods to reduce the conformational search space. Over the past two decades, various methods have been developed for protein inter-residue contacts prediction.Objective:We provide a comprehensive and systematic review of protein inter-residue contacts prediction methods.Results:Protein inter-residue contacts prediction methods are roughly classified into five categories: correlated mutations methods, machine-learning methods, fusion methods, templatebased methods and 3D model-based methods. In this paper, firstly we describe the common definition of protein inter-residue contacts and show the typical application of protein inter-residue contacts. Then, we present a comprehensive review of the three main categories for protein interresidue contacts prediction: correlated mutations methods, machine-learning methods and fusion methods. Besides, we analyze the constraints for each category. Furthermore, we compare several representative methods on the CASP11 dataset and discuss performances of these methods in detail.Conclusion:Correlated mutations methods achieve better performances for long-range contacts, while the machine-learning method performs well for short-range contacts. Fusion methods could take advantage of the machine-learning and correlated mutations methods. Employing more effective fusion strategy could be helpful to further improve the performances of fusion methods.

Download Full-text

IIMLP: integrated information-entropy-based method for LncRNA prediction

BMC Bioinformatics ◽

10.1186/s12859-020-03884-w ◽

2021 ◽

Vol 22 (S3) ◽

Author(s):

Junyi Li ◽

Huinian Li ◽

Xiao Ye ◽

Li Zhang ◽

Qingzhe Xu ◽

...

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

Information Entropy ◽

Area Under The Curve ◽

Prediction Method ◽

Machine Learning Algorithms ◽

Reading Frame ◽

Non Coding Rna ◽

The One ◽

Long Non Coding Rna

Abstract Background The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs. Results We developed the lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame, we apply supporting vector machine, XGBoost and random forest algorithms to distinguish human lncRNAs. We compare our method with the one which has more K-mer features and results show that our method has higher area under the curve up to 99.7905%. Conclusions We develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on the other functional elements in DNA sequences.

Download Full-text

Alternative processing of RNA transcribed from NMYC.

Molecular and Cellular Biology ◽

10.1128/mcb.7.12.4266 ◽

1987 ◽

Vol 7 (12) ◽

pp. 4266-4272 ◽

Cited By ~ 12

Author(s):

L W Stanton ◽

J M Bishop

Keyword(s):

Alternative Splicing ◽

Open Reading Frames ◽

Structure And Stability ◽

Alternative Processing ◽

Coding Potential ◽

Reading Frames

NMYC is a gene whose amplification and overexpression have been implicated in the generation of certain human malignancies. Little is known of how the expression of NMYC is normally controlled. We have therefore characterized transcription from the gene and the structure and stability of the resulting mRNAs. Transcription from NMYC is exceptionally complex: it initiates at numerous sites that may be grouped under the control of two promoters, and the multiplicity of initiation sites combines with alternative splicing to engender two forms of mRNA. The mRNAs have different 5' leader sequences (alternative first exons of the gene) but identical bodies (the second and third exons of the gene). Both forms of mRNA are unstable, with half-lives of ca. 15 min. Both encode the previously identified 65,000 and 67,000-dalton products of NMYC. However, the alternative first exons contain distinctive open reading frames that may diversify the coding potential of NMYC. The complexities in transcription of NMYC expand the means by which expression of the gene might be controlled.

Download Full-text