Predicting escalations in customer support with gradient boosting at the IEEE BigData 2020 Cup

Author(s):  
Peter Klimov ◽  
Vladimir Funtikov
2020 ◽  
Vol 39 (5) ◽  
pp. 6579-6590
Author(s):  
Sandy Çağlıyor ◽  
Başar Öztayşi ◽  
Selime Sezgin

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.


2020 ◽  
Vol 21 (2) ◽  
pp. 206-214
Author(s):  
V. S. Tynchenko ◽  
◽  
I. A. Golovenok ◽  
V. E. Petrenko ◽  
A. V. Milov ◽  
...  

2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


Author(s):  
Shu Cheng ◽  
Yanrui Ding

Background: Quantitative Structure Activity Relationship (QSAR) methods based on machine learning play a vital role in predicting biological effect. Objective: Considering the characteristics of the binding interface between ligands and the inhibitory neurotransmitter Gamma Aminobutyric Acid A(GABAA) receptor, we built a QSAR model of ligands that bind to the human GABAA receptor. Method: After feature selection with Mean Decrease Impurity, we selected 53 from 1,286 docked ligand molecular descriptors. Three QSAR models are built using gradient boosting regression tree algorithm based on the different combinations of docked ligand molecular descriptors and ligand-receptor interaction characteristics. Results: The features of the optimal QSAR model contain both the docked ligand molecular descriptors and ligand-receptor interaction characteristics. The Leave-One-Out-Cross-Validation (Q2 LOO) of the optimal QSAR model is 0.8974, the Coefficient of Determination (R2) for the testing set is 0.9261, the Mean Square Error (MSE) is 0.1862. We also used this model to predict the pIC50 of two new ligands, the differences between the predicted and experimental pIC50 are -0.02 and 0.03 respectively. Conclusion : We found the BELm2, BELe2, MATS1m, X5v, Mor08v, and Mor29m are crucial features, which can help to build the QSAR model more accurately.


2020 ◽  
Vol 15 ◽  
Author(s):  
Affan Alim ◽  
Abdul Rafay ◽  
Imran Naseem

Background: Proteins contribute significantly in every task of cellular life. Their functions encompass the building and repairing of tissues in human bodies and other organisms. Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze proteins are of prime significance for organisms that live in very cold areas. With the help of these proteins, the cold water organisms can survive below zero temperature and resist the water crystallization process which may cause the rupture in the internal cells and tissues. AFP’s have attracted attention and interest in food industries and cryopreservation. Objective: With the increase in the availability of genomic sequence data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on different structures. A consolidated method is proposed to produce the competitive performance on highly distinct AFP structure. Methods: In this study, we propose to use machine learning-based algorithms Principal Component Analysis (PCA) followed by Gradient Boosting (GB) for antifreeze protein identification. To analyze the performance and validation of the proposed model, various combinations of two segments composition of amino acid and dipeptide are used. PCA, in particular, is proposed to dimension reduction and high variance retaining of data which is followed by an ensemble method named gradient boosting for modelling and classification. Results: The proposed method obtained the superfluous performance on PDB, Pfam and Uniprot dataset as compared with the RAFP-Pred method. In experiment-3, by utilizing only 150 PCA components a high accuracy of 89.63 was achieved which is superior to the 87.41 utilizing 300 significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different dataset such that non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, our proposed method attained high sensitivity of 79.16 which is 12.50 better than state-of-the-art the RAFP-pred method. Conclusion: AFPs have a common function with distinct structure. Therefore, the development of a single model for different sequences often fails to AFPs. A robust results have been shown by our proposed model on the diversity of training and testing dataset. The results of the proposed model outperformed compared to the previous AFPs prediction method such as RAFP-Pred. Our model consists of PCA for dimension reduction followed by gradient boosting for classification. Due to simplicity, scalability properties and high performance result our model can be easily extended for analyzing the proteomic and genomic dataset.


2021 ◽  
Vol 15 ◽  
Author(s):  
Lichao Zhang ◽  
Zihong Huang ◽  
Liang Kong

Background: RNA-binding proteins establish posttranscriptional gene regulation by coordinating the maturation, editing, transport, stability, and translation of cellular RNAs. The immunoprecipitation experiments could identify interaction between RNA and proteins, but they are limited due to the experimental environment and material. Therefore, it is essential to construct computational models to identify the function sites. Objective: Although some computational methods have been proposed to predict RNA binding sites, the accuracy could be further improved. Moreover, it is necessary to construct a dataset with more samples to design a reliable model. Here we present a computational model based on multi-information sources to identify RNA binding sites. Method: We construct an accurate computational model named CSBPI_Site, based on xtreme gradient boosting. The specifically designed 15-dimensional feature vector captures four types of information (chemical shift, chemical bond, chemical properties and position information). Results: The satisfied accuracy of 0.86 and AUC of 0.89 were obtained by leave-one-out cross validation. Meanwhile, the accuracies were slightly different (range from 0.83 to 0.85) among three classifiers algorithm, which showed the novel features are stable and fit to multiple classifiers. These results showed that the proposed method is effective and robust for noncoding RNA binding sites identification. Conclusion: Our method based on multi-information sources is effective to represent the binding sites information among ncRNAs. The satisfied prediction results of Diels-Alder riboz-yme based on CSBPI_Site indicates that our model is valuable to identify the function site.


Sign in / Sign up

Export Citation Format

Share Document