PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization

2018 ◽  
Vol 35 (16) ◽  
pp. 2749-2756 ◽  
Author(s):  
Jialin Yu ◽  
Shaoping Shi ◽  
Fang Zhang ◽  
Guodong Chen ◽  
Man Cao

Abstract Motivation Protein glycation is a familiar post-translational modification (PTM) which is a two-step non-enzymatic reaction. Glycation not only impairs the function but also changes the characteristics of the proteins so that it is related to many human diseases. It is still much more difficult to systematically detect glycation sites due to the glycated residues without crucial patterns. Computational approaches, which can filter supposed sites prior to experimental verification, can extremely increase the efficiency of experiment work. However, the previous lysine glycation prediction method uses a small number of training datasets. Hence, the model is not generalized or pervasive. Results By searching from a new database, we collected a large dataset in Homo sapiens. PredGly, a novel software, can predict lysine glycation sites for H.sapiens, which was developed by combining multiple features. In addition, XGboost was adopted to optimize feature vectors and to improve the model performance. Through comparing various classifiers, support vector machine achieved an optimal performance. On the basis of a new independent test set, PredGly outperformed other glycation tools. It suggests that PredGly can provide more instructive guidance for further experimental research of lysine glycation. Availability and implementation https://github.com/yujialinncu/PredGly Supplementary information Supplementary data are available at Bioinformatics online.

Molecules ◽  
2017 ◽  
Vol 22 (11) ◽  
pp. 1891 ◽  
Author(s):  
Xiaowei Zhao ◽  
Xiaosa Zhao ◽  
Lingling Bao ◽  
Yonggang Zhang ◽  
Jiangyan Dai ◽  
...  

Processes ◽  
2020 ◽  
Vol 8 (5) ◽  
pp. 518
Author(s):  
Kexin Bi ◽  
Tong Qiu ◽  
Yizhen Huang

During the development of innovative products, consumer preferences are the essential factors for yogurt producers to improve their market share. A high-performance prediction method will be beneficial to understand the intrinsic relevance between preferences and sensory attributes. In this study, a novel deep learning method is proposed that uses an autoencoder to extract product features from the sensory attributes scored by experts, and the sensory features acquired are regressed on consumer preferences with support vector machine analysis. Model performance analysis, hedonic contour mapping, and feature clustering were implemented to validate the overall learning process. The results showed that the deep learning model can vouch an acceptable level of accuracy, and the hedonic mapping reflected could supply a great help for producers’ product design or modification. Finally, hierarchical clustering analysis revealed that for all three brands of yogurts, low temperature (4 °C) storage for no more than 4 weeks can promise the highest consumer preferences.


2021 ◽  
Vol 186 (Supplement_1) ◽  
pp. 445-451
Author(s):  
Yifei Sun ◽  
Navid Rashedi ◽  
Vikrant Vaze ◽  
Parikshit Shah ◽  
Ryan Halter ◽  
...  

ABSTRACT Introduction Early prediction of the acute hypotensive episode (AHE) in critically ill patients has the potential to improve outcomes. In this study, we apply different machine learning algorithms to the MIMIC III Physionet dataset, containing more than 60,000 real-world intensive care unit records, to test commonly used machine learning technologies and compare their performances. Materials and Methods Five classification methods including K-nearest neighbor, logistic regression, support vector machine, random forest, and a deep learning method called long short-term memory are applied to predict an AHE 30 minutes in advance. An analysis comparing model performance when including versus excluding invasive features was conducted. To further study the pattern of the underlying mean arterial pressure (MAP), we apply a regression method to predict the continuous MAP values using linear regression over the next 60 minutes. Results Support vector machine yields the best performance in terms of recall (84%). Including the invasive features in the classification improves the performance significantly with both recall and precision increasing by more than 20 percentage points. We were able to predict the MAP with a root mean square error (a frequently used measure of the differences between the predicted values and the observed values) of 10 mmHg 60 minutes in the future. After converting continuous MAP predictions into AHE binary predictions, we achieve a 91% recall and 68% precision. In addition to predicting AHE, the MAP predictions provide clinically useful information regarding the timing and severity of the AHE occurrence. Conclusion We were able to predict AHE with precision and recall above 80% 30 minutes in advance with the large real-world dataset. The prediction of regression model can provide a more fine-grained, interpretable signal to practitioners. Model performance is improved by the inclusion of invasive features in predicting AHE, when compared to predicting the AHE based on only the available, restricted set of noninvasive technologies. This demonstrates the importance of exploring more noninvasive technologies for AHE prediction.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Lei Li ◽  
Desheng Wu

PurposeThe infraction of securities regulations (ISRs) of listed firms in their day-to-day operations and management has become one of common problems. This paper proposed several machine learning approaches to forecast the risk at infractions of listed corporates to solve financial problems that are not effective and precise in supervision.Design/methodology/approachThe overall proposed research framework designed for forecasting the infractions (ISRs) include data collection and cleaning, feature engineering, data split, prediction approach application and model performance evaluation. We select Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machines, Artificial Neural Network and Long Short-Term Memory Networks (LSTMs) as ISRs prediction models.FindingsThe research results show that prediction performance of proposed models with the prior infractions provides a significant improvement of the ISRs than those without prior, especially for large sample set. The results also indicate when judging whether a company has infractions, we should pay attention to novel artificial intelligence methods, previous infractions of the company, and large data sets.Originality/valueThe findings could be utilized to address the problems of identifying listed corporates' ISRs at hand to a certain degree. Overall, results elucidate the value of the prior infraction of securities regulations (ISRs). This shows the importance of including more data sources when constructing distress models and not only focus on building increasingly more complex models on the same data. This is also beneficial to the regulatory authorities.


2018 ◽  
Vol 35 (16) ◽  
pp. 2757-2765 ◽  
Author(s):  
Balachandran Manavalan ◽  
Shaherin Basith ◽  
Tae Hwan Shin ◽  
Leyi Wei ◽  
Gwang Lee

AbstractMotivationCardiovascular disease is the primary cause of death globally accounting for approximately 17.7 million deaths per year. One of the stakes linked with cardiovascular diseases and other complications is hypertension. Naturally derived bioactive peptides with antihypertensive activities serve as promising alternatives to pharmaceutical drugs. So far, there is no comprehensive analysis, assessment of diverse features and implementation of various machine-learning (ML) algorithms applied for antihypertensive peptide (AHTP) model construction.ResultsIn this study, we utilized six different ML algorithms, namely, Adaboost, extremely randomized tree (ERT), gradient boosting (GB), k-nearest neighbor, random forest (RF) and support vector machine (SVM) using 51 feature descriptors derived from eight different feature encodings for the prediction of AHTPs. While ERT-based trained models performed consistently better than other algorithms regardless of various feature descriptors, we treated them as baseline predictors, whose predicted probability of AHTPs was further used as input features separately for four different ML-algorithms (ERT, GB, RF and SVM) and developed their corresponding meta-predictors using a two-step feature selection protocol. Subsequently, the integration of four meta-predictors through an ensemble learning approach improved the balanced prediction performance and model robustness on the independent dataset. Upon comparison with existing methods, mAHTPred showed superior performance with an overall improvement of approximately 6–7% in both benchmarking and independent datasets.Availability and implementationThe user-friendly online prediction tool, mAHTPred is freely accessible at http://thegleelab.org/mAHTPred.Supplementary informationSupplementary data are available at Bioinformatics online.


2018 ◽  
Vol 85 (6) ◽  
pp. 434-442 ◽  
Author(s):  
Noushin Mokhtari ◽  
Clemens Gühmann

Abstract For diagnosis and predictive maintenance of mechatronic systems, monitoring of bearings is essential. An important building block for this is the determination of the bearing friction condition. This paper deals with the possibility of monitoring different journal bearing friction states, such as mixed and fluid friction, and examines a new approach to distinguish between different friction intensities under several speed and load combinations based on feature extraction and feature selection methods applied on acoustic emission (AE) signals. The aim of this work is to identify separation effective features of AE signals to subsequently classify the journal bearing friction states. Furthermore, the acquired features give information about the mixed friction intensity, which is significant for remaining useful lifetime (RUL) prediction. Time domain features as well as features in the frequency domain have been investigated in this work. To increase the sensitivity of the extracted features the AE signals were transformed to the frequency-time-domain using continuous wavelet transform (CWT). Significant frequency bands are determined to separate different friction states more effective. A support vector machine (SVM) is used to classify the signals into three different friction classes. In the end the idea for an RUL prediction method by using the already determined information is given and explained.


Sign in / Sign up

Export Citation Format

Share Document