scholarly journals Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences

2015 ◽  
Vol 32 (6) ◽  
pp. 821-827 ◽  
Author(s):  
Enrique Audain ◽  
Yassel Ramos ◽  
Henning Hermjakob ◽  
Darren R. Flower ◽  
Yasset Perez-Riverol

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.

Cancers ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 3817
Author(s):  
Shi-Jer Lou ◽  
Ming-Feng Hou ◽  
Hong-Tai Chang ◽  
Chong-Chi Chiu ◽  
Hao-Hsien Lee ◽  
...  

No studies have discussed machine learning algorithms to predict recurrence within 10 years after breast cancer surgery. This study purposed to compare the accuracy of forecasting models to predict recurrence within 10 years after breast cancer surgery and to identify significant predictors of recurrence. Registry data for breast cancer surgery patients were allocated to a training dataset (n = 798) for model development, a testing dataset (n = 171) for internal validation, and a validating dataset (n = 171) for external validation. Global sensitivity analysis was then performed to evaluate the significance of the selected predictors. Demographic characteristics, clinical characteristics, quality of care, and preoperative quality of life were significantly associated with recurrence within 10 years after breast cancer surgery (p < 0.05). Artificial neural networks had the highest prediction performance indices. Additionally, the surgeon volume was the best predictor of recurrence within 10 years after breast cancer surgery, followed by hospital volume and tumor stage. Accurate recurrence within 10 years prediction by machine learning algorithms may improve precision in managing patients after breast cancer surgery and improve understanding of risk factors for recurrence within 10 years after breast cancer surgery.


2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Nalindren Naicker ◽  
Timothy Adeliyi ◽  
Jeanette Wing

Educational Data Mining (EDM) is a rich research field in computer science. Tools and techniques in EDM are useful to predict student performance which gives practitioners useful insights to develop appropriate intervention strategies to improve pass rates and increase retention. The performance of the state-of-the-art machine learning classifiers is very much dependent on the task at hand. Investigating support vector machines has been used extensively in classification problems; however, the extant of literature shows a gap in the application of linear support vector machines as a predictor of student performance. The aim of this study was to compare the performance of linear support vector machines with the performance of the state-of-the-art classical machine learning algorithms in order to determine the algorithm that would improve prediction of student performance. In this quantitative study, an experimental research design was used. Experiments were set up using feature selection on a publicly available dataset of 1000 alpha-numeric student records. Linear support vector machines benchmarked with ten categorical machine learning algorithms showed superior performance in predicting student performance. The results of this research showed that features like race, gender, and lunch influence performance in mathematics whilst access to lunch was the primary factor which influences reading and writing performance.


2020 ◽  
Vol 1500 ◽  
pp. 012131
Author(s):  
Firdaus ◽  
Andre Herviant Juliano ◽  
Naufal Rachmatullah ◽  
Sarifah Putri Rafflesia ◽  
Dinna Yunika Hardiyanti ◽  
...  

Author(s):  
Shuangxia Ren ◽  
Jill Zupetic ◽  
Mehdi Nouraie ◽  
Xinghua Lu ◽  
Richard D. Boyce ◽  
...  

AbstractBackgroundThe partial pressure of oxygen (PaO2)/fraction of oxygen delivered (FIO2) ratio is the reference standard for assessment of hypoxemia in mechanically ventilated patients. Non-invasive monitoring with the peripheral saturation of oxygen (SpO2) is increasingly utilized to estimate PaO2 because it does not require invasive sampling. Several equations have been reported to impute PaO2/FIO2 from SpO2 /FIO2. However, machine-learning algorithms to impute the PaO2 from the SpO2 has not been compared to published equations.Research QuestionHow do machine learning algorithms perform at predicting the PaO2 from SpO2 compared to previously published equations?MethodsThree machine learning algorithms (neural network, regression, and kernel-based methods) were developed using 7 clinical variable features (n=9,900 ICU events) and subsequently 3 features (n=20,198 ICU events) as input into the models from data available in mechanically ventilated patients from the Medical Information Mart for Intensive Care (MIMIC) III database. As a regression task, the machine learning models were used to impute PaO2 values. As a classification task, the models were used to predict patients with moderate-to-severe hypoxemic respiratory failure based on a clinically relevant cut-off of PaO2/FIO2 ≤ 150. The accuracy of the machine learning models was compared to published log-linear and non-linear equations. An online imputation calculator was created.ResultsCompared to seven features, three features (SpO2, FiO2 and PEEP) were sufficient to impute PaO2/FIO2 ratio using a large dataset. Any of the tested machine learning models enabled imputation of PaO2/FIO2 from the SpO2/FIO2 with lower error and had greater accuracy in predicting PaO2/FIO2 ≤ 150 compared to published equations. Using three features, the machine learning models showed superior performance in imputing PaO2 across the entire span of SpO2 values, including those ≥ 97%.InterpretationThe improved performance shown for the machine learning algorithms suggests a promising framework for future use in large datasets.


In a large distributed virtualized environment, predicting the alerting source from its text seems to be daunting task. This paper explores the option of using machine learning algorithm to solve this problem. Unfortunately, our training dataset is highly imbalanced. Where 96% of alerting data is reported by 24% of alerting sources. This is the expected dataset in any live distributed virtualized environment, where new version of device will have relatively less alert compared to older devices. Any classification effort with such imbalanced dataset present different set of challenges compared to binary classification. This type of skewed data distribution makes conventional machine learning less effective, especially while predicting the minority device type alerts. Our challenge is to build a robust model which can cope with this imbalanced dataset and achieves relative high level of prediction accuracy. This research work stared with traditional regression and classification algorithms using bag of words model. Then word2vec and doc2vec models are used to represent the words in vector formats, which preserve the sematic meaning of the sentence. With this alerting text with similar message will have same vector form representation. This vectorized alerting text is used with Logistic Regression for model building. This yields better accuracy, but the model is relatively complex and demand more computational resources. Finally, simple neural network is used for this multi-class text classification problem domain by using keras and tensorflow libraries. A simple two layered neural network yielded 99 % accuracy, even though our training dataset was not balanced. This paper goes through the qualitative evaluation of the different machine learning algorithms and their respective result. Finally, two layered deep learning algorithms is selected as final solution, since it takes relatively less resource and time with better accuracy values.


Author(s):  
V. P. Yadav ◽  
R. Prasad ◽  
R. Bala ◽  
A. K. Vishwakarma ◽  
S. A. Yadav ◽  
...  

Abstract. The leaf area index (LAI) is one of key variable of crops which plays important role in agriculture, ecology and climate change for global circulation models to compute energy and water fluxes. In the recent research era, the machine-learning algorithms have provided accurate computational approaches for the estimation of crops biophysical parameters using remotely sensed data. The three machine-learning algorithms, random forest regression (RFR), support vector regression (SVR) and artificial neural network regression (ANNR) were used to estimate the LAI for crops in the present study. The three different dates of Landsat-8 satellite images were used during January 2017 – March 2017 at different crops growth conditions in Varanasi district, India. The sampling regions were fully covered by major Rabi season crops like wheat, barley and mustard etc. In total pooled data, 60% samples were taken for the training of the algorithms and rest 40% samples were taken as testing and validation of the machinelearning regressions algorithms. The highest sensitivity of normalized difference vegetation index (NDVI) with LAI was found using RFR algorithms (R2 = 0.884, RMSE = 0.404) as compared to SVR (R2 = 0.847, RMSE = 0.478) and ANNR (R2 = 0.829, RMSE = 0.404). Therefore, RFR algorithms can be used for accurate estimation of LAI for crops using satellite data.


2021 ◽  
Vol 297 ◽  
pp. 01029
Author(s):  
Mohammed Azza ◽  
Jabran Daaif ◽  
Adnane Aouidate ◽  
El Hadi Chahid ◽  
Said Belaaouad

In this paper, we discuss the prediction of future solar cell photo-current generated by the machine learning algorithm. For the selection of prediction methods, we compared and explored different prediction methods. Precision, MSE and MAE were used as models due to its adaptable and probabilistic methodology on model selection. This study uses machine learning algorithms as a research method that develops models for predicting solar cell photo-current. We create an electric current prediction model. In view of the models of machine learning algorithms for example, linear regression, Lasso regression, K Nearest Neighbors, decision tree and random forest, watch their order precision execution. In this point, we recommend a solar cell photocurrent prediction model for better information based on resistance assessment. These reviews show that the linear regression algorithm, given the precision, reliably outperforms alternative models in performing the solar cell photo-current prediction Iph


2020 ◽  
Vol 12 (11) ◽  
pp. 1838 ◽  
Author(s):  
Zhao Zhang ◽  
Paulo Flores ◽  
C. Igathinathane ◽  
Dayakar L. Naik ◽  
Ravi Kiran ◽  
...  

The current mainstream approach of using manual measurements and visual inspections for crop lodging detection is inefficient, time-consuming, and subjective. An innovative method for wheat lodging detection that can overcome or alleviate these shortcomings would be welcomed. This study proposed a systematic approach for wheat lodging detection in research plots (372 experimental plots), which consisted of using unmanned aerial systems (UAS) for aerial imagery acquisition, manual field evaluation, and machine learning algorithms to detect the occurrence or not of lodging. UAS imagery was collected on three different dates (23 and 30 July 2019, and 8 August 2019) after lodging occurred. Traditional machine learning and deep learning were evaluated and compared in this study in terms of classification accuracy and standard deviation. For traditional machine learning, five types of features (i.e. gray level co-occurrence matrix, local binary pattern, Gabor, intensity, and Hu-moment) were extracted and fed into three traditional machine learning algorithms (i.e., random forest (RF), neural network, and support vector machine) for detecting lodged plots. For the datasets on each imagery collection date, the accuracies of the three algorithms were not significantly different from each other. For any of the three algorithms, accuracies on the first and last date datasets had the lowest and highest values, respectively. Incorporating standard deviation as a measurement of performance robustness, RF was determined as the most satisfactory. Regarding deep learning, three different convolutional neural networks (simple convolutional neural network, VGG-16, and GoogLeNet) were tested. For any of the single date datasets, GoogLeNet consistently had superior performance over the other two methods. Further comparisons between RF and GoogLeNet demonstrated that the detection accuracies of the two methods were not significantly different from each other (p > 0.05); hence, the choice of any of the two would not affect the final detection accuracies. However, considering the fact that the average accuracy of GoogLeNet (93%) was larger than RF (91%), it was recommended to use GoogLeNet for wheat lodging detection. This research demonstrated that UAS RGB imagery, coupled with the GoogLeNet machine learning algorithm, can be a novel, reliable, objective, simple, low-cost, and effective (accuracy > 90%) tool for wheat lodging detection.


2020 ◽  
Vol 12 (21) ◽  
pp. 3609
Author(s):  
Xinchuan Li ◽  
Juhua Luo ◽  
Xiuliang Jin ◽  
Qiaoning He ◽  
Yun Niu

Spatially continuous soil thickness data at large scales are usually not readily available and are often difficult and expensive to acquire. Various machine learning algorithms have become very popular in digital soil mapping to predict and map the spatial distribution of soil properties. Identifying the controlling environmental variables of soil thickness and selecting suitable machine learning algorithms are vitally important in modeling. In this study, 11 quantitative and four qualitative environmental variables were selected to explore the main variables that affect soil thickness. Four commonly used machine learning algorithms (multiple linear regression (MLR), support vector regression (SVR), random forest (RF), and extreme gradient boosting (XGBoost) were evaluated as individual models to separately predict and obtain a soil thickness distribution map in Henan Province, China. In addition, the two stacking ensemble models using least absolute shrinkage and selection operator (LASSO) and generalized boosted regression model (GBM) were tested and applied to build the most reliable and accurate estimation model. The results showed that variable selection was a very important part of soil thickness modeling. Topographic wetness index (TWI), slope, elevation, land use and enhanced vegetation index (EVI) were the most influential environmental variables in soil thickness modeling. Comparative results showed that the XGBoost model outperformed the MLR, RF and SVR models. Importantly, the two stacking models achieved higher performance than the single model, especially when using GBM. In terms of accuracy, the proposed stacking method explained 64.0% of the variation for soil thickness. The results of our study provide useful alternative approaches for mapping soil thickness, with potential for use with other soil properties.


2019 ◽  
Vol 27 (1) ◽  
pp. 13-21 ◽  
Author(s):  
Qiang Wei ◽  
Zongcheng Ji ◽  
Zhiheng Li ◽  
Jingcheng Du ◽  
Jingqi Wang ◽  
...  

AbstractObjectiveThis article presents our approaches to extraction of medications and associated adverse drug events (ADEs) from clinical documents, which is the second track of the 2018 National NLP Clinical Challenges (n2c2) shared task.Materials and MethodsThe clinical corpus used in this study was from the MIMIC-III database and the organizers annotated 303 documents for training and 202 for testing. Our system consists of 2 components: a named entity recognition (NER) and a relation classification (RC) component. For each component, we implemented deep learning-based approaches (eg, BI-LSTM-CRF) and compared them with traditional machine learning approaches, namely, conditional random fields for NER and support vector machines for RC, respectively. In addition, we developed a deep learning-based joint model that recognizes ADEs and their relations to medications in 1 step using a sequence labeling approach. To further improve the performance, we also investigated different ensemble approaches to generating optimal performance by combining outputs from multiple approaches.ResultsOur best-performing systems achieved F1 scores of 93.45% for NER, 96.30% for RC, and 89.05% for end-to-end evaluation, which ranked #2, #1, and #1 among all participants, respectively. Additional evaluations show that the deep learning-based approaches did outperform traditional machine learning algorithms in both NER and RC. The joint model that simultaneously recognizes ADEs and their relations to medications also achieved the best performance on RC, indicating its promise for relation extraction.ConclusionIn this study, we developed deep learning approaches for extracting medications and their attributes such as ADEs, and demonstrated its superior performance compared with traditional machine learning algorithms, indicating its uses in broader NER and RC tasks in the medical domain.


Sign in / Sign up

Export Citation Format

Share Document