scholarly journals Hybrid Basketball Game Outcome Prediction Model by Integrating Data Mining Methods for the National Basketball Association

Entropy ◽  
2021 ◽  
Vol 23 (4) ◽  
pp. 477
Author(s):  
Wei-Jen Chen ◽  
Mao-Jhen Jhou ◽  
Tian-Shyug Lee ◽  
Chi-Jie Lu

The sports market has grown rapidly over the last several decades. Sports outcomes prediction is an attractive sports analytic challenge as it provides useful information for operations in the sports market. In this study, a hybrid basketball game outcomes prediction scheme is developed for predicting the final score of the National Basketball Association (NBA) games by integrating five data mining techniques, including extreme learning machine, multivariate adaptive regression splines, k-nearest neighbors, eXtreme gradient boosting (XGBoost), and stochastic gradient boosting. Designed features are generated by merging different game-lags information from fundamental basketball statistics and used in the proposed scheme. This study collected data from all the games of the NBA 2018–2019 seasons. There are 30 teams in the NBA and each team play 82 games per season. A total of 2460 NBA game data points were collected. Empirical results illustrated that the proposed hybrid basketball game prediction scheme achieves high prediction performance and identifies suitable game-lag information and relevant game features (statistics). Our findings suggested that a two-stage XGBoost model using four pieces of game-lags information achieves the best prediction performance among all competing models. The six designed features, including averaged defensive rebounds, averaged two-point field goal percentage, averaged free throw percentage, averaged offensive rebounds, averaged assists, and averaged three-point field goal attempts, from four game-lags have a greater effect on the prediction of final scores of NBA games than other game-lags. The findings of this study provide relevant insights and guidance for other team or individual sports outcomes prediction research.


Processes ◽  
2021 ◽  
Vol 9 (9) ◽  
pp. 1563
Author(s):  
Chi-Jie Lu ◽  
Tian-Shyug Lee ◽  
Chien-Chih Wang ◽  
Wei-Jen Chen

Developing an effective sports performance analysis process is an attractive issue in sports team management. This study proposed an improved sports outcome prediction process by integrating adaptive weighted features and machine learning algorithms for basketball game score prediction. The feature engineering method is used to construct designed features based on game-lag information and adaptive weighting of variables in the proposed prediction process. These designed features are then applied to the five machine learning methods, including classification and regression trees (CART), random forest (RF), stochastic gradient boosting (SGB), eXtreme gradient boosting (XGBoost), and extreme learning machine (ELM) for constructing effective prediction models. The empirical results from National Basketball Association (NBA) data revealed that the proposed sports outcome prediction process could generate a promising prediction result compared to the competing models without adaptive weighting features. Our results also showed that the machine learning models with four game-lags information and adaptive weighting of power could generate better prediction performance.



2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Satoko Hiura ◽  
Shige Koseki ◽  
Kento Koyama

AbstractIn predictive microbiology, statistical models are employed to predict bacterial population behavior in food using environmental factors such as temperature, pH, and water activity. As the amount and complexity of data increase, handling all data with high-dimensional variables becomes a difficult task. We propose a data mining approach to predict bacterial behavior using a database of microbial responses to food environments. Listeria monocytogenes, which is one of pathogens, population growth and inactivation data under 1,007 environmental conditions, including five food categories (beef, culture medium, pork, seafood, and vegetables) and temperatures ranging from 0 to 25 °C, were obtained from the ComBase database (www.combase.cc). We used eXtreme gradient boosting tree, a machine learning algorithm, to predict bacterial population behavior from eight explanatory variables: ‘time’, ‘temperature’, ‘pH’, ‘water activity’, ‘initial cell counts’, ‘whether the viable count is initial cell number’, and two types of categories regarding food. The root mean square error of the observed and predicted values was approximately 1.0 log CFU regardless of food category, and this suggests the possibility of predicting viable bacterial counts in various foods. The data mining approach examined here will enable the prediction of bacterial population behavior in food by identifying hidden patterns within a large amount of data.



2021 ◽  
pp. 289-301
Author(s):  
B. Martín ◽  
J. González–Arias ◽  
J. A. Vicente–Vírseda

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.



2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yong Chen

An improved nonlinear weighted extreme gradient boosting (XGBoost) technique is developed to forecast length of stay for patients with imbalance data. The algorithm first chooses an effective technique for fitting the duration of stay and determining the distribution law and then optimizes the negative log likelihood loss function using a heuristic nonlinear weighting method based on sample percentage. Theoretical and practical results reveal that, when compared to existing algorithms, the XGBoost method based on nonlinear weighting may achieve higher classification accuracy and better prediction performance, which is beneficial in treating more patients with fewer hospital beds.



2020 ◽  
Vol 12 (23) ◽  
pp. 3925
Author(s):  
Ivan Pilaš ◽  
Mateo Gašparović ◽  
Alan Novkinić ◽  
Damir Klobučar

The presented study demonstrates a bi-sensor approach suitable for rapid and precise up-to-date mapping of forest canopy gaps for the larger spatial extent. The approach makes use of Unmanned Aerial Vehicle (UAV) red, green and blue (RGB) images on smaller areas for highly precise forest canopy mask creation. Sentinel-2 was used as a scaling platform for transferring information from the UAV to a wider spatial extent. Various approaches to an improvement in the predictive performance were examined: (I) the highest R2 of the single satellite index was 0.57, (II) the highest R2 using multiple features obtained from the single-date, S-2 image was 0.624, and (III) the highest R2 on the multitemporal set of S-2 images was 0.697. Satellite indices such as Atmospherically Resistant Vegetation Index (ARVI), Infrared Percentage Vegetation Index (IPVI), Normalized Difference Index (NDI45), Pigment-Specific Simple Ratio Index (PSSRa), Modified Chlorophyll Absorption Ratio Index (MCARI), Color Index (CI), Redness Index (RI), and Normalized Difference Turbidity Index (NDTI) were the dominant predictors in most of the Machine Learning (ML) algorithms. The more complex ML algorithms such as the Support Vector Machines (SVM), Random Forest (RF), Stochastic Gradient Boosting (GBM), Extreme Gradient Boosting (XGBoost), and Catboost that provided the best performance on the training set exhibited weaker generalization capabilities. Therefore, a simpler and more robust Elastic Net (ENET) algorithm was chosen for the final map creation.



2011 ◽  
pp. 1915-1935
Author(s):  
Dan Steinberg ◽  
Mikhaylo Golovnya ◽  
Nicholas Scott Cardell

Mobile phone customers face many choices regarding handset hardware, add-on services, and features to subscribe to from their service providers. Mobile phone companies are now increas-ingly interested in the drivers of migration to third generation (3G) hardware and services. Using real world data provided to the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2006 Data Mining Competition we explore the effectiveness of Friedman’s stochastic gradient boosting (Multiple Additive Regression Trees [MART]) for the rapid development of a high performance predictive model.



2021 ◽  
pp. 22-37
Author(s):  
Han Gao ◽  
Pei Shan Fam ◽  
Lea Tien Tay ◽  
Heng Chin Low

Tree-based gradient boosting (TGB) models gain popularity in various areas due to their powerful prediction ability and fast processing speed. This study aims to compare the landslide spatial prediction performance of TGB models and non-tree-based machine learning (NML) models in Penang Island, Malaysia. Two specific instances of TGB models, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM) and two specific instances of NML models, artificial neural network (ANN) and support vector machine (SVM), are applied to make predictions of landslide susceptibility. Feature selection and oversampling techniques are considered to improve the prediction performance as well. The results are analyzed and discussed mainly based on receiver operating characteristic (ROC) curves as well as the area under the curves (AUC). The results show that TGB models give better prediction performance compared to NML models, no matter what the sample size is. The TGB models’ performances are improved when training with the dataset considering either feature selection or oversampling techniques. The highest AUC value of 0.9525 is obtained from the combination of XGBoost and SMOTE. The landslide susceptibility maps (LSMs) produced by XGBoost and LightGBM can provide valuable information in landslide management and mitigation in Penang Island, Malaysia.



Symmetry ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 89 ◽  
Author(s):  
Tuong Le ◽  
Sung Baik

Recently, a standard dataset namely SCADI (Self-Care Activities Dataset) based on the International Classification of Functioning, Disability, and Health for Children and Youth framework for self-care problems identification of children with physical and motor disabilities was introduced. This is a very interesting, important and challenging topic due to its usefulness in medical diagnosis. This study proposes a robust framework using a sampling technique and extreme gradient boosting (FSX) to improve the prediction performance for the SCADI dataset. The proposed framework first converts the original dataset to a new dataset with a smaller number of dimensions. Then, our proposed framework balances the new dataset in the previous step using oversampling techniques with different ratios. Next, extreme gradient boosting was used to diagnose the problems. The experiments in terms of prediction performance and feature importance were conducted to show the effectiveness of FSX as well as to analyse the results. The experimental results show that FSX that uses the Synthetic Minority Over-sampling Technique (SMOTE) for the oversampling module outperforms the ANN (Artificial Neural Network) -based approach, Support vector machine (SVM) and Random Forest for the SCADI dataset. The overall accuracy of the proposed framework reaches 85.4%, a pretty high performance, which can be used for self-care problem classification in medical diagnosis.



2021 ◽  
Author(s):  
Lei Feng ◽  
Xiangni Tian ◽  
Yousry A. El-Kassaby ◽  
Jian Qiu ◽  
Ze Feng ◽  
...  

Abstract Background: Melia azedarach L. is a globally distributed tree species of economic importance; however, it is unclear how the species distribution will respond to future climate changes.Methods: We aimed to select the most accurate one among seven data mining models to predict the species suitable contemporary and future habitats. These models include: maximum entropy (MaxEnt), support vector machine (SVM), generalized linear model (GLM), random forest (RF), naive bayesian model (NBM), extreme gradient boosting (XGBoost), and gradient boosting machine (GBM). A total of 906 M. azedarach locations were identified, and sixteen climate predictors were used for model building. The models’ validity was assessed using three measures (Area Under the Curves (AUC), kappa, and accuracy). Results: We found that the RF provided the most outstanding performance in prediction power and generalization capacity. The top climate factors affecting the species distribution were mean coldest month temperature (MCMT), followed by the number of frost-free days (NFFD), degree-days above 18°C (DD>18), temperature difference between MWMT and MCMT, or continentality (TD), mean annual precipitation (MAP), and degree-days below 18°C (DD<18). We projected that future suitable habitat of this species would increase under both the RCP4.5 and RCP8.5 scenarios for the 2020s, 2050s, and 2080s.Conclusion: Our findings are expected to assist in better understanding the impact of climate change on the species and provide scientific basis for its planting and conservation.



Churner Customer is a main tricky and one of the most important issues for large companies, due to the straight impact on the incomes of the companies especially in the telecom domain, companies are searching for advance strategies to predict churn/non-churn customer. This research focuses on the construction of a predictive model to identify each customer as churner or not and gain additional insights about their service consumers. The main contribution is to overcome the limitation of independently based on data mining strategies by developing approaches and derived network metrics such as centrality and connectivity between customers to incorporate network mining with traditional data mining. Social network measurements e.g. Leverage, flow Bet, Page Rank, Cluster Coefficients and Eccentricity are joined with other attributes in the original network dataset to enhance the performance of the proposed methodology. The risk of churn can be predictive by preparing an extensive cleaning the raw data for churn modeling, It divides customers into clusters based on Gower distance and k-medoids algorithm to help understand and predict churner users, classification model using Extreme Gradient Boosting “XGBoost”, assessment the model performance by computation the centralities metrics as new attributes appended to the original network dataset. Experiments conducted on Telecom shows that with an average value of all statistics accuracy not lower than 98.27%, while the average accuracy for the original dataset with it is clusters is not exceeded than 0.97%. The proposed method for churners detection which combines social impacts and network contents based on clustering significantly improved the prediction accuracy for telecom dataset as compared to prediction using the call log details, network information without implement of clustering , thus validate the hypothesis that combining social network attributes and Call/SMS information of the users for churn prediction could yields substantially improved of customer churn prediction.



Sign in / Sign up

Export Citation Format

Share Document