XGBPRH: Prediction of Binding Hot Spots at Protein–RNA Interfaces Utilizing Extreme Gradient Boosting

Hot spot residues at protein–RNA complexes are vitally important for investigating the underlying molecular recognition mechanism. Accurately identifying protein–RNA binding hot spots is critical for drug designing and protein engineering. Although some progress has been made by utilizing various available features and a series of machine learning approaches, these methods are still in the infant stage. In this paper, we present a new computational method named XGBPRH, which is based on an eXtreme Gradient Boosting (XGBoost) algorithm and can effectively predict hot spot residues in protein–RNA interfaces utilizing an optimal set of properties. Firstly, we download 47 protein–RNA complexes and calculate a total of 156 sequence, structure, exposure, and network features. Next, we adopt a two-step feature selection algorithm to extract a combination of 6 optimal features from the combination of these 156 features. Compared with the state-of-the-art approaches, XGBPRH achieves better performances with an area under the ROC curve (AUC) score of 0.817 and an F1-score of 0.802 on the independent test set. Meanwhile, we also apply XGBPRH to two case studies. The results demonstrate that the method can effectively identify novel energy hotspots.

Download Full-text

Prediction of hot spots in protein–DNA binding interfaces based on supervised isometric feature mapping and extreme gradient boosting

BMC Bioinformatics ◽

10.1186/s12859-020-03683-3 ◽

2020 ◽

Vol 21 (S13) ◽

Cited By ~ 2

Author(s):

Ke Li ◽

Sijia Zhang ◽

Di Yan ◽

Yannan Bin ◽

Junfeng Xia

Keyword(s):

Feature Selection ◽

Manifold Learning ◽

Hot Spots ◽

Large Scale ◽

Computational Method ◽

Gradient Boosting ◽

Feature Mapping ◽

Accessible Information ◽

Extreme Gradient Boosting ◽

Isometric Feature Mapping

Abstract Background Identification of hot spots in protein-DNA interfaces provides crucial information for the research on protein-DNA interaction and drug design. As experimental methods for determining hot spots are time-consuming, labor-intensive and expensive, there is a need for developing reliable computational method to predict hot spots on a large scale. Results Here, we proposed a new method named sxPDH based on supervised isometric feature mapping (S-ISOMAP) and extreme gradient boosting (XGBoost) to predict hot spots in protein-DNA complexes. We obtained 114 features from a combination of the protein sequence, structure, network and solvent accessible information, and systematically assessed various feature selection methods and feature dimensionality reduction methods based on manifold learning. The results show that the S-ISOMAP method is superior to other feature selection or manifold learning methods. XGBoost was then used to develop hot spots prediction model sxPDH based on the three dimensionality-reduced features obtained from S-ISOMAP. Conclusion Our method sxPDH boosts prediction performance using S-ISOMAP and XGBoost. The AUC of the model is 0.773, and the F1 score is 0.713. Experimental results on benchmark dataset indicate that sxPDH can achieve generally better performance in predicting hot spots compared to the state-of-the-art methods.

Download Full-text

XPredRBR: Accurate and Fast Prediction of RNA-Binding Residues in Proteins Using eXtreme Gradient Boosting

Bioinformatics Research and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-94968-0_14 ◽

2018 ◽

pp. 163-173

Author(s):

Lei Deng ◽

Zuojin Dong ◽

Hui Liu

Keyword(s):

Rna Binding ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Binding Residues ◽

Fast Prediction

Download Full-text

Classification of Hot Spots using XGBoost and LightGBM Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9459.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 722-724

Keyword(s):

Computational Methods ◽

Protein Interactions ◽

Hot Spots ◽

Cell Metabolism ◽

Pearson Correlation ◽

Classification Performance ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Hub Proteins

Protein-Protein Interactions referred as PPIs perform significant role in biological functions like cell metabolism, immune response, signal transduction etc. Hot spots are small fractions of residues in interfaces and provide substantial binding energy in PPIs. Therefore, identification of hot spots is important to discover and analyze molecular medicines and diseases. The current strategy, alanine scanning isn't pertinent to enormous scope applications since the technique is very costly and tedious. The existing computational methods are poor in classification performance as well as accuracy in prediction. They are concerned with the topological structure and gene expression of hub proteins. The proposed system focuses on hot spots of hub proteins by eliminating redundant as well as highly correlated features using Pearson Correlation Coefficient and Support Vector Machine based feature elimination. Extreme Gradient boosting and LightGBM algorithms are used to ensemble a set of weak classifiers to form a strong classifier. The proposed system shows better accuracy than the existing computational methods. The model can also be used to predict accurate molecular inhibitors for specific PPIs

Download Full-text

Understanding Multi-Vehicle Collision Patterns on Freeways—A Machine Learning Approach

Infrastructures ◽

10.3390/infrastructures5080062 ◽

2020 ◽

Vol 5 (8) ◽

pp. 62

Author(s):

Clint Morris ◽

Jidong J. Yang

Keyword(s):

Machine Learning ◽

Statistical Methods ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Learning Approaches ◽

Crash Analysis ◽

Suitable Alternative ◽

Crash Data ◽

Extreme Gradient Boosting ◽

Modern Machine

Generating meaningful inferences from crash data is vital to improving highway safety. Classic statistical methods are fundamental to crash data analysis and often regarded for their interpretability. However, given the complexity of crash mechanisms and associated heterogeneity, classic statistical methods, which lack versatility, might not be sufficient for granular crash analysis because of the high dimensional features involved in crash-related data. In contrast, machine learning approaches, which are more flexible in structure and capable of harnessing richer data sources available today, emerges as a suitable alternative. With the aid of new methods for model interpretation, the complex machine learning models, previously considered enigmatic, can be properly interpreted. In this study, two modern machine learning techniques, Linear Discriminate Analysis and eXtreme Gradient Boosting, were explored to classify three major types of multi-vehicle crashes (i.e., rear-end, same-direction sideswipe, and angle) occurred on Interstate 285 in Georgia. The study demonstrated the utility and versatility of modern machine learning methods in the context of crash analysis, particularly in understanding the potential features underlying different crash patterns on freeways.

Download Full-text

PredCID: prediction of driver frameshift indels in human cancer

Briefings in Bioinformatics ◽

10.1093/bib/bbaa119 ◽

2020 ◽

Cited By ~ 7

Author(s):

Zhenyu Yue ◽

Xinlu Chu ◽

Junfeng Xia

Keyword(s):

Cancer Biology ◽

Missing Values ◽

Human Cancer ◽

Computational Method ◽

Genomic Research ◽

Validation Dataset ◽

Gradient Boosting ◽

Cancer Driver ◽

Extreme Gradient Boosting ◽

Frameshift Indels

Abstract The discrimination of driver from passenger mutations has been a hot topic in the field of cancer biology. Although recent advances have improved the identification of driver mutations in cancer genomic research, there is no computational method specific for the cancer frameshift indels (insertions or/and deletions) yet. In addition, existing pathogenic frameshift indel predictors may suffer from plenty of missing values because of different choices of transcripts during the variant annotation processes. In this study, we proposed a computational model, called PredCID (Predictor for Cancer driver frameshift InDels), for accurately predicting cancer driver frameshift indels. Gene, DNA, transcript and protein level features are combined together and selected for classification with eXtreme Gradient Boosting classifier. Benchmarking results on the cross-validation dataset and independent dataset showed that PredCID achieves better and robust performance compared with existing noncancer-specific methods in distinguishing cancer driver frameshift indels from passengers and is therefore a valuable method for deeper understanding of frameshift indels in human cancer. PredCID is freely available for academic research at http://bioinfo.ahu.edu.cn:8080/PredCID.

Download Full-text

Gradient Boosting Machine Learning to Improve Satellite-Derived Column Water Vapor Measurement Error

10.5194/amt-2019-308 ◽

2019 ◽

Cited By ~ 1

Author(s):

Allan C. Just ◽

Yang Liu ◽

Meytar Sorek-Hamer ◽

Johnathan Rush ◽

Michael Dorman ◽

...

Keyword(s):

Machine Learning ◽

Water Vapor ◽

Measurement Error ◽

Earth Science ◽

Atmospheric Correction ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Learning Approaches ◽

Sensing Applications ◽

Extreme Gradient Boosting

Abstract. The atmospheric products of the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm include column water vapor (CWV) at 1 km resolution, derived from daily overpasses of NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Aqua and Terra satellites. We have recently shown that machine learning using extreme gradient boosting (XGBoost) can improve the estimation of MAIAC aerosol optical depth (AOD). Although MAIAC CWV is generally well validated (Pearson’s R > 0.97 versus CWV from AERONET sun photometers), it has not yet been assessed whether machine-learning approaches can further improve CWV. Using a novel spatiotemporal cross-validation approach to avoid overfitting, our XGBoost model with nine features derived from land use terms, date, and ancillary variables from the MAIAC retrieval, quantifies and can correct a substantial portion of measurement error relative to collocated measures at AERONET sites (26.9 % and 16.5 % decrease in Root Mean Square Error (RMSE) for Terra and Aqua datasets, respectively) in the Northeastern USA, 2000–2015. We use machine-learning interpretation tools to illustrate complex patterns of measurement error and describe a positive bias in MAIAC Terra CWV worsening in recent summertime conditions. We validate our predictive model on MAIAC CWV estimates at independent stations from the SuomiNet GPS network where our corrections decrease the RMSE by 19.7 % and 9.5 % for Terra and Aqua MAIAC CWV. Empirically correcting for measurement error with machine-learning algorithms is a post-processing opportunity to improve satellite-derived CWV data for Earth science and remote sensing applications.

Download Full-text

Gradient boosting machine learning to improve satellite-derived column water vapor measurement error

Atmospheric Measurement Techniques ◽

10.5194/amt-13-4669-2020 ◽

2020 ◽

Vol 13 (9) ◽

pp. 4669-4681

Author(s):

Allan C. Just ◽

Yang Liu ◽

Meytar Sorek-Hamer ◽

Johnathan Rush ◽

Michael Dorman ◽

...

Keyword(s):

Machine Learning ◽

Water Vapor ◽

Measurement Error ◽

Earth Science ◽

Atmospheric Correction ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Learning Approaches ◽

Sensing Applications ◽

Extreme Gradient Boosting

Abstract. The atmospheric products of the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm include column water vapor (CWV) at a 1 km resolution, derived from daily overpasses of NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Aqua and Terra satellites. We have recently shown that machine learning using extreme gradient boosting (XGBoost) can improve the estimation of MAIAC aerosol optical depth (AOD). Although MAIAC CWV is generally well validated (Pearson's R > 0.97 versus CWV from AERONET sun photometers), it has not yet been assessed whether machine-learning approaches can further improve CWV. Using a novel spatiotemporal cross-validation approach to avoid overfitting, our XGBoost model, with nine features derived from land use terms, date, and ancillary variables from the MAIAC retrieval, quantifies and can correct a substantial portion of measurement error relative to collocated measurements at AERONET sites (26.9 % and 16.5 % decrease in root mean square error (RMSE) for Terra and Aqua datasets, respectively) in the Northeastern USA, 2000–2015. We use machine-learning interpretation tools to illustrate complex patterns of measurement error and describe a positive bias in MAIAC Terra CWV worsening in recent summertime conditions. We validate our predictive model on MAIAC CWV estimates at independent stations from the SuomiNet GPS network where our corrections decrease the RMSE by 19.7 % and 9.5 % for Terra and Aqua MAIAC CWV. Empirically correcting for measurement error with machine-learning algorithms is a postprocessing opportunity to improve satellite-derived CWV data for Earth science and remote sensing applications.

Download Full-text

Applying Deep Neural Networks and Ensemble Machine Learning Methods to Forecast Airborne Ambrosia Pollen

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph16111992 ◽

2019 ◽

Vol 16 (11) ◽

pp. 1992 ◽

Cited By ~ 6

Author(s):

Gebreab K. Zewdie ◽

David J. Lary ◽

Estelle Levetin ◽

Gemechu F. Garuma

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Land Surface ◽

Deep Neural Networks ◽

Airborne Pollen ◽

Training Data ◽

Gradient Boosting ◽

Learning Approaches ◽

Ambrosia Pollen ◽

Extreme Gradient Boosting

Allergies to airborne pollen are a significant issue affecting millions of Americans. Consequently, accurately predicting the daily concentration of airborne pollen is of significant public benefit in providing timely alerts. This study presents a method for the robust estimation of the concentration of airborne Ambrosia pollen using a suite of machine learning approaches including deep learning and ensemble learners. Each of these machine learning approaches utilize data from the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric weather and land surface reanalysis. The machine learning approaches used for developing a suite of empirical models are deep neural networks, extreme gradient boosting, random forests and Bayesian ridge regression methods for developing our predictive model. The training data included twenty-four years of daily pollen concentration measurements together with ECMWF weather and land surface reanalysis data from 1987 to 2011 is used to develop the machine learning predictive models. The last six years of the dataset from 2012 to 2017 is used to independently test the performance of the machine learning models. The correlation coefficients between the estimated and actual pollen abundance for the independent validation datasets for the deep neural networks, random forest, extreme gradient boosting and Bayesian ridge were 0.82, 0.81, 0.81 and 0.75 respectively, showing that machine learning can be used to effectively forecast the concentrations of airborne pollen.

Download Full-text

Pharmacy Impact on Covid-19 Vaccination Progress Using Machine Learning Approach

Journal of Pharmaceutical Research International ◽

10.9734/jpri/2021/v33i38a32076 ◽

2021 ◽

pp. 202-217

Author(s):

Shawni Dutta ◽

Upasana Mukherjee ◽

Samir Kumar Bandyopadhyay

Keyword(s):

Machine Learning ◽

Social Life ◽

Mean Squared Error ◽

Severe Depression ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Learning Approaches ◽

Human Beings ◽

Infected People ◽

Extreme Gradient Boosting

The novel coronavirus disease (COVID-19) has created immense threats to public health on various levels around the globe. The unpredictable outbreak of this disease and the pandemic situation are causing severe depression, anxiety and other mental as physical health related problems among the human beings. This deadly disease has put social, economic condition of the entire world into an enormous challenge. To combat against this disease, vaccination is essential as it will boost the immune system of human beings while being in the contact with the infected people. The vaccination process is thus necessary to confront the outbreak of COVID-19. The worldwide vaccination progress should be tracked to identify how fast the entire economic as well as social life will be stabilized. The monitor of the vaccination progress, a machine learning based Regressor model is approached in this study. This vaccination tracking process has been applied on the data starting from 14th December, 2020 to 24th April, 2021. A couple of ensemble based machine learning Regressor models such as Random Forest, Extra Trees, Gradient Boosting, AdaBoost and Extreme Gradient Boosting are implemented and their predictive performance are compared. The comparative study reveals that the Extra trees Regressor outperforms with minimized mean absolute error (MAE) of 6.465 and root mean squared error (RMSE) of 8.127. The uniqueness of this study relies on assessing as well as predicting vaccination intake progress by utilizing automated process offered by machine learning techniques. The innovative idea of the method is that the vaccination process and their priority are considered in the paper. Among several existing machine learning approaches, the ensemble based learning paradigms are employed in this study so that improved prediction efficiency can be delivered.

Download Full-text

Stock Portfolio Prediction by Multi-Target Decision Support

iSys - Brazilian Journal of Information Systems ◽

10.5753/isys.2019.381 ◽

2019 ◽

Vol 12 (1) ◽

pp. 05-27

Author(s):

Everton Jose Santana ◽

João Augusto Provin Ribeiro Da silva ◽

Saulo Martiello Mastelini ◽

Sylvio Barbon Jr

Keyword(s):

Decision Support ◽

Random Forest ◽

Stock Market ◽

Prediction Models ◽

Deep Structure ◽

Predictive Performance ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Investing in the stock market is a complex process due to its high volatility caused by factors as exchange rates, political events, inflation and the market history. To support investor's decisions, the prediction of future stock price and economic metrics is valuable. With the hypothesis that there is a relation among investment performance indicators, the goal of this paper was exploring multi-target regression (MTR) methods to estimate 6 different indicators and finding out the method that would best suit in an automated prediction tool for decision support regarding predictive performance. The experiments were based on 4 datasets, corresponding to 4 different time periods, composed of 63 combinations of weights of stock-picking concepts each, simulated in the US stock market. We compared traditional machine learning approaches with seven state-of-the-art MTR solutions: Stacked Single Target, Ensemble of Regressor Chains, Deep Structure for Tracking Asynchronous Regressor Stacking, Deep Regressor Stacking, Multi-output Tree Chaining, Multi-target Augment Stacking and Multi-output Random Forest (MORF). With the exception of MORF, traditional approaches and the MTR methods were evaluated with Extreme Gradient Boosting, Random Forest and Support Vector Machine regressors. By means of extensive experimental evaluation, our results showed that the most recent MTR solutions can achieve suitable predictive performance, improving all the scenarios (14.70% in the best one, considering all target variables and periods). In this sense, MTR is a proper strategy for building stock market decision support system based on prediction models.

Download Full-text