Advances in the Prediction of Protein Subcellular Locations with Machine Learning

2019 ◽  
Vol 14 (5) ◽  
pp. 406-421 ◽  
Author(s):  
Ting-He Zhang ◽  
Shao-Wu Zhang

Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods. Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers. Result & Conclusion: Concomitant with a large number of protein sequences generated by highthroughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.


2021 ◽  
Vol 22 (S10) ◽  
Author(s):  
Zhijun Liao ◽  
Gaofeng Pan ◽  
Chao Sun ◽  
Jijun Tang

Abstract Background Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations. Results Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848. Conclusion The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.



2021 ◽  
pp. 000370282110345
Author(s):  
Tatu Rojalin ◽  
Dexter Antonio ◽  
Ambarish Kulkarni ◽  
Randy P. Carney

Surface-enhanced Raman scattering (SERS) is a powerful technique for sensitive label-free analysis of chemical and biological samples. While much recent work has established sophisticated automation routines using machine learning and related artificial intelligence methods, these efforts have largely focused on downstream processing (e.g., classification tasks) of previously collected data. While fully automated analysis pipelines are desirable, current progress is limited by cumbersome and manually intensive sample preparation and data collection steps. Specifically, a typical lab-scale SERS experiment requires the user to evaluate the quality and reliability of the measurement (i.e., the spectra) as the data are being collected. This need for expert user-intuition is a major bottleneck that limits applicability of SERS-based diagnostics for point-of-care clinical applications, where trained spectroscopists are likely unavailable. While application-agnostic numerical approaches (e.g., signal-to-noise thresholding) are useful, there is an urgent need to develop algorithms that leverage expert user intuition and domain knowledge to simplify and accelerate data collection steps. To address this challenge, in this work, we introduce a machine learning-assisted method at the acquisition stage. We tested six common algorithms to measure best performance in the context of spectral quality judgment. For adoption into future automation platforms, we developed an open-source python package tailored for rapid expert user annotation to train machine learning algorithms. We expect that this new approach to use machine learning to assist in data acquisition can serve as a useful building block for point-of-care SERS diagnostic platforms.



2019 ◽  
Vol 36 (6) ◽  
pp. 1908-1914 ◽  
Author(s):  
Ying-Ying Xu ◽  
Hong-Bin Shen ◽  
Robert F Murphy

Abstract Motivation Systematic and comprehensive analysis of protein subcellular location as a critical part of proteomics (‘location proteomics’) has been studied for many years, but annotating protein subcellular locations and understanding variation of the location patterns across various cell types and states is still challenging. Results In this work, we used immunohistochemistry images from the Human Protein Atlas as the source of subcellular location information, and built classification models for the complex protein spatial distribution in normal and cancerous tissues. The models can automatically estimate the fractions of protein in different subcellular locations, and can help to quantify the changes of protein distribution from normal to cancer tissues. In addition, we examined the extent to which different annotated protein pathways and complexes showed similarity in the locations of their member proteins, and then predicted new potential proteins for these networks. Availability and implementation The dataset and code are available at: www.csbio.sjtu.edu.cn/bioinf/complexsubcellularpatterns. Supplementary information Supplementary data are available at Bioinformatics online.



2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Thomas Kurmann ◽  
Siqing Yu ◽  
Pablo Márquez-Neila ◽  
Andreas Ebneter ◽  
Martin Zinkernagel ◽  
...  

Abstract In ophthalmology, retinal biological markers, or biomarkers, play a critical role in the management of chronic eye conditions and in the development of new therapeutics. While many imaging technologies used today can visualize these, Optical Coherence Tomography (OCT) is often the tool of choice due to its ability to image retinal structures in three dimensions at micrometer resolution. But with widespread use in clinical routine, and growing prevalence in chronic retinal conditions, the quantity of scans acquired worldwide is surpassing the capacity of retinal specialists to inspect these in meaningful ways. Instead, automated analysis of scans using machine learning algorithms provide a cost effective and reliable alternative to assist ophthalmologists in clinical routine and research. We present a machine learning method capable of consistently identifying a wide range of common retinal biomarkers from OCT scans. Our approach avoids the need for costly segmentation annotations and allows scans to be characterized by biomarker distributions. These can then be used to classify scans based on their underlying pathology in a device-independent way.



2008 ◽  
Vol 9 (1) ◽  
Author(s):  
Myron Peto ◽  
Andrzej Kloczkowski ◽  
Vasant Honavar ◽  
Robert L Jernigan


Author(s):  
Walaa Alkady ◽  
Muhammad Zanaty ◽  
Heba M. Afify

Abstract The coronavirus infection is increasingly evolving to be an international epidemic in 27 countries as a serious respiratory disease. Therefore, the computational biology carrying this virus that correlated with the human population is urgently needed. In this paper, the classification of the human protein sequences of COVID-19 according to the country is applied by machine learning algorithms. The proposed model is based on the distinguishing of 9238 sequences by three stages including data preprocessing, data labeling, and classification. In the first stage, the function of data preprocessing converts the amino acids of COVID-19 protein sequences to eight groups of numbers based on volume and dipole of the amino acids. In the second stage, there are two methods for data labeling of 27 countries from 0 to 26. The first method is based on the selection of one number for each country according to the code number of countries while the second method is based on binary elements only for each country. The classification algorithms are executed to discover different COVID-19 protein sequences according to their countries. The findings are concluded that the accuracy of 100% performed by country based binary labeling method with Linear Regression (LR) or K-Nearest Neighbor (KNN) or Support Vector Machine (SVM) classifiers. Further, it found that the USA with large data records in infection rate has more priority for correct classification compared to other countries with a low data rate. The unbalanced data for COVID-19 protein sequences is considered a major issue, especially the available data in USA represented 76% from a total of 9238 sequences. As a consequence, this proposed model will help as a diagnostic bioinformatics tool for the COVID-19 protein sequences among different countries.



2016 ◽  
Vol 13 (1) ◽  
pp. 23-33 ◽  
Author(s):  
Julia Rahman ◽  
Nazrul Islam Mondal ◽  
Khaled Ben Islam ◽  
Al Mehedi Hasan

Summary For the importance of protein subcellular localization in different branch of life science and drug discovery, researchers have focused their attentions on protein subcellular localization prediction. Effective representation of features from protein sequences plays most vital role in protein subcellular localization prediction specially in case of machine learning technique. Single feature representation like pseudo amino acid composition (PseAAC), physiochemical property model (PPM), amino acid index distribution (AAID) contains insufficient information from protein sequences. To deal with such problem, we have proposed two feature fusion representations AAIDPAAC and PPMPAAC to work with Support Vector Machine classifier, which fused PseAAC with PPM and AAID accordingly. We have evaluated performance for both single and fused feature representation of Gram-negative bacterial dataset. We have got at least 3% more actual accuracy by AAIDPAAC and 2% more locative accuracy by PPMPAAC than single feature representation.



2005 ◽  
Vol 14 (9) ◽  
pp. 1351-1359 ◽  
Author(s):  
Ting Zhao ◽  
M. Velliste ◽  
M.V. Boland ◽  
R.F. Murphy


2020 ◽  
Author(s):  
Victor Anton ◽  
Jannes Germishuys ◽  
Matthias Obst

This paper describes a data system to analyse large amounts of subsea movie data for marine ecological research. The system consists of three distinct modules for data management and archiving, citizen science, and machine learning in a high performance computation environment. It allows scientists to upload underwater footage to a customised citizen science website hosted by Zooniverse, where volunteers from the public classify the footage. Classifications with high agreement among citizen scientists are then used to train machine learning algorithms. An application programming interface allows researchers to test the algorithms and track biological objects in new footage. We tested our system using recordings from remotely operated vehicles (ROVs) in a Marine Protected Area, the Kosterhavet National Park in Sweden. Results indicate a strong decline of cold-water corals in the park over a period of 15 years, showing that our system allows to effectively extract valuable occurrence and abundance data for key ecological species from underwater footage. We argue that the combination of citizen science tools, machine learning, and high performance computers are key to successfully analyse large amounts of image data in the future, suggesting that these services should be consolidated and interlinked by national and international research infrastructures. Novel information system to analyse marine underwater footage.



Sign in / Sign up

Export Citation Format

Share Document