Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences

Enrique Audain; Yassel Ramos; Henning Hermjakob; Darren R. Flower; Yasset Perez-Riverol

doi:10.1093/bioinformatics/btv674

Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences

Bioinformatics ◽

10.1093/bioinformatics/btv674 ◽

2015 ◽

Vol 32 (6) ◽

pp. 821-827 ◽

Cited By ~ 19

Author(s):

Enrique Audain ◽

Yassel Ramos ◽

Henning Hermjakob ◽

Darren R. Flower ◽

Yasset Perez-Riverol

Keyword(s):

Machine Learning ◽

Isoelectric Point ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Basis Set ◽

Superior Performance ◽

Supplementary Information ◽

Training Dataset ◽

Accurate Estimation ◽

Prediction Methods

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Machine Learning Algorithms to Predict Recurrence within 10 Years after Breast Cancer Surgery: A Prospective Cohort Study

Cancers ◽

10.3390/cancers12123817 ◽

2020 ◽

Vol 12 (12) ◽

pp. 3817

Author(s):

Shi-Jer Lou ◽

Ming-Feng Hou ◽

Hong-Tai Chang ◽

Chong-Chi Chiu ◽

Hao-Hsien Lee ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Learning Algorithms ◽

External Validation ◽

Model Development ◽

Cancer Surgery ◽

Machine Learning Algorithms ◽

Breast Cancer Surgery ◽

Training Dataset

No studies have discussed machine learning algorithms to predict recurrence within 10 years after breast cancer surgery. This study purposed to compare the accuracy of forecasting models to predict recurrence within 10 years after breast cancer surgery and to identify significant predictors of recurrence. Registry data for breast cancer surgery patients were allocated to a training dataset (n = 798) for model development, a testing dataset (n = 171) for internal validation, and a validating dataset (n = 171) for external validation. Global sensitivity analysis was then performed to evaluate the significance of the selected predictors. Demographic characteristics, clinical characteristics, quality of care, and preoperative quality of life were significantly associated with recurrence within 10 years after breast cancer surgery (p < 0.05). Artificial neural networks had the highest prediction performance indices. Additionally, the surgeon volume was the best predictor of recurrence within 10 years after breast cancer surgery, followed by hospital volume and tumor stage. Accurate recurrence within 10 years prediction by machine learning algorithms may improve precision in managing patients after breast cancer surgery and improve understanding of risk factors for recurrence within 10 years after breast cancer surgery.

Download Full-text

Linear Support Vector Machines for Prediction of Student Performance in School-Based Education

Mathematical Problems in Engineering ◽

10.1155/2020/4761468 ◽

2020 ◽

Vol 2020 ◽

pp. 1-7

Author(s):

Nalindren Naicker ◽

Timothy Adeliyi ◽

Jeanette Wing

Keyword(s):

Machine Learning ◽

Support Vector Machines ◽

Student Performance ◽

State Of The Art ◽

Learning Algorithms ◽

The State ◽

Machine Learning Algorithms ◽

Superior Performance ◽

Support Vector ◽

Vector Machines

Educational Data Mining (EDM) is a rich research field in computer science. Tools and techniques in EDM are useful to predict student performance which gives practitioners useful insights to develop appropriate intervention strategies to improve pass rates and increase retention. The performance of the state-of-the-art machine learning classifiers is very much dependent on the task at hand. Investigating support vector machines has been used extensively in classification problems; however, the extant of literature shows a gap in the application of linear support vector machines as a predictor of student performance. The aim of this study was to compare the performance of linear support vector machines with the performance of the state-of-the-art classical machine learning algorithms in order to determine the algorithm that would improve prediction of student performance. In this quantitative study, an experimental research design was used. Experiments were set up using feature selection on a publicly available dataset of 1000 alpha-numeric student records. Linear support vector machines benchmarked with ten categorical machine learning algorithms showed superior performance in predicting student performance. The results of this research showed that features like race, gender, and lunch influence performance in mathematics whilst access to lunch was the primary factor which influences reading and writing performance.

Download Full-text

The impact of Negative to Positive Training Dataset Ratio on Atrial Fibrillation Classification Machine Learning Algorithms Performance

Journal of Physics Conference Series ◽

10.1088/1742-6596/1500/1/012131 ◽

2020 ◽

Vol 1500 ◽

pp. 012131

Author(s):

Firdaus ◽

Andre Herviant Juliano ◽

Naufal Rachmatullah ◽

Sarifah Putri Rafflesia ◽

Dinna Yunika Hardiyanti ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Dataset ◽

The Impact

Download Full-text

Imputation of PaO2 from SpO2 values from the MIMIC-III Critical Care Database Using Machine-Learning Based Algorithms

10.1101/2021.04.21.21255877 ◽

2021 ◽

Cited By ~ 1

Author(s):

Shuangxia Ren ◽

Jill Zupetic ◽

Mehdi Nouraie ◽

Xinghua Lu ◽

Richard D. Boyce ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Superior Performance ◽

Learning Models ◽

Mechanically Ventilated ◽

Mechanically Ventilated Patients ◽

Mimic Iii ◽

Ventilated Patients ◽

Machine Learning Models

AbstractBackgroundThe partial pressure of oxygen (PaO2)/fraction of oxygen delivered (FIO2) ratio is the reference standard for assessment of hypoxemia in mechanically ventilated patients. Non-invasive monitoring with the peripheral saturation of oxygen (SpO2) is increasingly utilized to estimate PaO2 because it does not require invasive sampling. Several equations have been reported to impute PaO2/FIO2 from SpO2 /FIO2. However, machine-learning algorithms to impute the PaO2 from the SpO2 has not been compared to published equations.Research QuestionHow do machine learning algorithms perform at predicting the PaO2 from SpO2 compared to previously published equations?MethodsThree machine learning algorithms (neural network, regression, and kernel-based methods) were developed using 7 clinical variable features (n=9,900 ICU events) and subsequently 3 features (n=20,198 ICU events) as input into the models from data available in mechanically ventilated patients from the Medical Information Mart for Intensive Care (MIMIC) III database. As a regression task, the machine learning models were used to impute PaO2 values. As a classification task, the models were used to predict patients with moderate-to-severe hypoxemic respiratory failure based on a clinically relevant cut-off of PaO2/FIO2 ≤ 150. The accuracy of the machine learning models was compared to published log-linear and non-linear equations. An online imputation calculator was created.ResultsCompared to seven features, three features (SpO2, FiO2 and PEEP) were sufficient to impute PaO2/FIO2 ratio using a large dataset. Any of the tested machine learning models enabled imputation of PaO2/FIO2 from the SpO2/FIO2 with lower error and had greater accuracy in predicting PaO2/FIO2 ≤ 150 compared to published equations. Using three features, the machine learning models showed superior performance in imputing PaO2 across the entire span of SpO2 values, including those ≥ 97%.InterpretationThe improved performance shown for the machine learning algorithms suggests a promising framework for future use in large datasets.

Download Full-text

Predicting Alert Source Device using Machine Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d1526.079920 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1-10

Keyword(s):

Neural Network ◽

Machine Learning ◽

Model Building ◽

Learning Algorithm ◽

Learning Algorithms ◽

Research Work ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Imbalanced Dataset ◽

Daunting Task

In a large distributed virtualized environment, predicting the alerting source from its text seems to be daunting task. This paper explores the option of using machine learning algorithm to solve this problem. Unfortunately, our training dataset is highly imbalanced. Where 96% of alerting data is reported by 24% of alerting sources. This is the expected dataset in any live distributed virtualized environment, where new version of device will have relatively less alert compared to older devices. Any classification effort with such imbalanced dataset present different set of challenges compared to binary classification. This type of skewed data distribution makes conventional machine learning less effective, especially while predicting the minority device type alerts. Our challenge is to build a robust model which can cope with this imbalanced dataset and achieves relative high level of prediction accuracy. This research work stared with traditional regression and classification algorithms using bag of words model. Then word2vec and doc2vec models are used to represent the words in vector formats, which preserve the sematic meaning of the sentence. With this alerting text with similar message will have same vector form representation. This vectorized alerting text is used with Logistic Regression for model building. This yields better accuracy, but the model is relatively complex and demand more computational resources. Finally, simple neural network is used for this multi-class text classification problem domain by using keras and tensorflow libraries. A simple two layered neural network yielded 99 % accuracy, even though our training dataset was not balanced. This paper goes through the qualitative evaluation of the different machine learning algorithms and their respective result. Finally, two layered deep learning algorithms is selected as final solution, since it takes relatively less resource and time with better accuracy values.

Download Full-text

A COMPARISON OF MACHINE-LEARNING REGRESSION ALGORITHMS FOR THE ESTIMATION OF LAI USING LANDSAT - 8 SATELLITE DATA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-4-w16-679-2019 ◽

2019 ◽

Vol XLII-4/W16 ◽

pp. 679-683

Author(s):

V. P. Yadav ◽

R. Prasad ◽

R. Bala ◽

A. K. Vishwakarma ◽

S. A. Yadav ◽

...

Keyword(s):

Machine Learning ◽

Satellite Data ◽

Vegetation Index ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Accurate Estimation ◽

Support Vector ◽

Landsat 8 ◽

Area Index ◽

Global Circulation Models

Abstract. The leaf area index (LAI) is one of key variable of crops which plays important role in agriculture, ecology and climate change for global circulation models to compute energy and water fluxes. In the recent research era, the machine-learning algorithms have provided accurate computational approaches for the estimation of crops biophysical parameters using remotely sensed data. The three machine-learning algorithms, random forest regression (RFR), support vector regression (SVR) and artificial neural network regression (ANNR) were used to estimate the LAI for crops in the present study. The three different dates of Landsat-8 satellite images were used during January 2017 – March 2017 at different crops growth conditions in Varanasi district, India. The sampling regions were fully covered by major Rabi season crops like wheat, barley and mustard etc. In total pooled data, 60% samples were taken for the training of the algorithms and rest 40% samples were taken as testing and validation of the machinelearning regressions algorithms. The highest sensitivity of normalized difference vegetation index (NDVI) with LAI was found using RFR algorithms (R2 = 0.884, RMSE = 0.404) as compared to SVR (R2 = 0.847, RMSE = 0.478) and ANNR (R2 = 0.829, RMSE = 0.404). Therefore, RFR algorithms can be used for accurate estimation of LAI for crops using satellite data.

Download Full-text

Automatic solution for solar cell photo-current prediction using machine learning

E3S Web of Conferences ◽

10.1051/e3sconf/202129701029 ◽

2021 ◽

Vol 297 ◽

pp. 01029

Author(s):

Mohammed Azza ◽

Jabran Daaif ◽

Adnane Aouidate ◽

El Hadi Chahid ◽

Said Belaaouad

Keyword(s):

Machine Learning ◽

Solar Cell ◽

Linear Regression ◽

Prediction Model ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Prediction Methods ◽

Lasso Regression ◽

Photo Current

In this paper, we discuss the prediction of future solar cell photo-current generated by the machine learning algorithm. For the selection of prediction methods, we compared and explored different prediction methods. Precision, MSE and MAE were used as models due to its adaptable and probabilistic methodology on model selection. This study uses machine learning algorithms as a research method that develops models for predicting solar cell photo-current. We create an electric current prediction model. In view of the models of machine learning algorithms for example, linear regression, Lasso regression, K Nearest Neighbors, decision tree and random forest, watch their order precision execution. In this point, we recommend a solar cell photocurrent prediction model for better information based on resistance assessment. These reviews show that the linear regression algorithm, given the precision, reliably outperforms alternative models in performing the solar cell photo-current prediction Iph

Download Full-text

Wheat Lodging Detection from UAS Imagery Using Machine Learning Algorithms

Remote Sensing ◽

10.3390/rs12111838 ◽

2020 ◽

Vol 12 (11) ◽

pp. 1838 ◽

Cited By ~ 8

Author(s):

Zhao Zhang ◽

Paulo Flores ◽

C. Igathinathane ◽

Dayakar L. Naik ◽

Ravi Kiran ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Standard Deviation ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Superior Performance ◽

Support Vector ◽

Unmanned Aerial Systems

The current mainstream approach of using manual measurements and visual inspections for crop lodging detection is inefficient, time-consuming, and subjective. An innovative method for wheat lodging detection that can overcome or alleviate these shortcomings would be welcomed. This study proposed a systematic approach for wheat lodging detection in research plots (372 experimental plots), which consisted of using unmanned aerial systems (UAS) for aerial imagery acquisition, manual field evaluation, and machine learning algorithms to detect the occurrence or not of lodging. UAS imagery was collected on three different dates (23 and 30 July 2019, and 8 August 2019) after lodging occurred. Traditional machine learning and deep learning were evaluated and compared in this study in terms of classification accuracy and standard deviation. For traditional machine learning, five types of features (i.e. gray level co-occurrence matrix, local binary pattern, Gabor, intensity, and Hu-moment) were extracted and fed into three traditional machine learning algorithms (i.e., random forest (RF), neural network, and support vector machine) for detecting lodged plots. For the datasets on each imagery collection date, the accuracies of the three algorithms were not significantly different from each other. For any of the three algorithms, accuracies on the first and last date datasets had the lowest and highest values, respectively. Incorporating standard deviation as a measurement of performance robustness, RF was determined as the most satisfactory. Regarding deep learning, three different convolutional neural networks (simple convolutional neural network, VGG-16, and GoogLeNet) were tested. For any of the single date datasets, GoogLeNet consistently had superior performance over the other two methods. Further comparisons between RF and GoogLeNet demonstrated that the detection accuracies of the two methods were not significantly different from each other (p > 0.05); hence, the choice of any of the two would not affect the final detection accuracies. However, considering the fact that the average accuracy of GoogLeNet (93%) was larger than RF (91%), it was recommended to use GoogLeNet for wheat lodging detection. This research demonstrated that UAS RGB imagery, coupled with the GoogLeNet machine learning algorithm, can be a novel, reliable, objective, simple, low-cost, and effective (accuracy > 90%) tool for wheat lodging detection.

Download Full-text

Improving Soil Thickness Estimations Based on Multiple Environmental Variables with Stacking Ensemble Methods

Remote Sensing ◽

10.3390/rs12213609 ◽

2020 ◽

Vol 12 (21) ◽

pp. 3609

Author(s):

Xinchuan Li ◽

Juhua Luo ◽

Xiuliang Jin ◽

Qiaoning He ◽

Yun Niu

Keyword(s):

Machine Learning ◽

Soil Properties ◽

Environmental Variables ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Accurate Estimation ◽

Support Vector ◽

Topographic Wetness Index ◽

Soil Thickness ◽

Extreme Gradient Boosting

Spatially continuous soil thickness data at large scales are usually not readily available and are often difficult and expensive to acquire. Various machine learning algorithms have become very popular in digital soil mapping to predict and map the spatial distribution of soil properties. Identifying the controlling environmental variables of soil thickness and selecting suitable machine learning algorithms are vitally important in modeling. In this study, 11 quantitative and four qualitative environmental variables were selected to explore the main variables that affect soil thickness. Four commonly used machine learning algorithms (multiple linear regression (MLR), support vector regression (SVR), random forest (RF), and extreme gradient boosting (XGBoost) were evaluated as individual models to separately predict and obtain a soil thickness distribution map in Henan Province, China. In addition, the two stacking ensemble models using least absolute shrinkage and selection operator (LASSO) and generalized boosted regression model (GBM) were tested and applied to build the most reliable and accurate estimation model. The results showed that variable selection was a very important part of soil thickness modeling. Topographic wetness index (TWI), slope, elevation, land use and enhanced vegetation index (EVI) were the most influential environmental variables in soil thickness modeling. Comparative results showed that the XGBoost model outperformed the MLR, RF and SVR models. Importantly, the two stacking models achieved higher performance than the single model, especially when using GBM. In terms of accuracy, the proposed stacking method explained 64.0% of the variation for soil thickness. The results of our study provide useful alternative approaches for mapping soil thickness, with potential for use with other soil properties.

Download Full-text

A study of deep learning approaches for medication and adverse drug event extraction from clinical text

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz063 ◽

2019 ◽

Vol 27 (1) ◽

pp. 13-21 ◽

Cited By ~ 11

Author(s):

Qiang Wei ◽

Zongcheng Ji ◽

Zhiheng Li ◽

Jingcheng Du ◽

Jingqi Wang ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Learning Algorithms ◽

Joint Model ◽

Machine Learning Algorithms ◽

Entity Recognition ◽

Superior Performance ◽

Drug Event ◽

Support Vector ◽

Learning Approaches

AbstractObjectiveThis article presents our approaches to extraction of medications and associated adverse drug events (ADEs) from clinical documents, which is the second track of the 2018 National NLP Clinical Challenges (n2c2) shared task.Materials and MethodsThe clinical corpus used in this study was from the MIMIC-III database and the organizers annotated 303 documents for training and 202 for testing. Our system consists of 2 components: a named entity recognition (NER) and a relation classification (RC) component. For each component, we implemented deep learning-based approaches (eg, BI-LSTM-CRF) and compared them with traditional machine learning approaches, namely, conditional random fields for NER and support vector machines for RC, respectively. In addition, we developed a deep learning-based joint model that recognizes ADEs and their relations to medications in 1 step using a sequence labeling approach. To further improve the performance, we also investigated different ensemble approaches to generating optimal performance by combining outputs from multiple approaches.ResultsOur best-performing systems achieved F1 scores of 93.45% for NER, 96.30% for RC, and 89.05% for end-to-end evaluation, which ranked #2, #1, and #1 among all participants, respectively. Additional evaluations show that the deep learning-based approaches did outperform traditional machine learning algorithms in both NER and RC. The joint model that simultaneously recognizes ADEs and their relations to medications also achieved the best performance on RC, indicating its promise for relation extraction.ConclusionIn this study, we developed deep learning approaches for extracting medications and their attributes such as ADEs, and demonstrated its superior performance compared with traditional machine learning algorithms, indicating its uses in broader NER and RC tasks in the medical domain.

Download Full-text