scholarly journals Augmented Data and XGBoost Improvement for Sales Forecasting in the Large-Scale Retail Sector

2021 ◽  
Vol 11 (17) ◽  
pp. 7793
Author(s):  
Alessandro Massaro ◽  
Antonio Panarese ◽  
Daniele Giannone ◽  
Angelo Galiano

The organized large-scale retail sector has been gradually establishing itself around the world, and has increased activities exponentially in the pandemic period. This modern sales system uses Data Mining technologies processing precious information to increase profit. In this direction, the extreme gradient boosting (XGBoost) algorithm was applied in an industrial project as a supervised learning algorithm to predict product sales including promotion condition and a multiparametric analysis. The implemented XGBoost model was trained and tested by the use of the Augmented Data (AD) technique in the event that the available data are not sufficient to achieve the desired accuracy, as for many practical cases of artificial intelligence data processing, where a large dataset is not available. The prediction was applied to a grid of segmented customers by allowing personalized services according to their purchasing behavior. The AD technique conferred a good accuracy if compared with results adopting the initial dataset with few records. An improvement of the prediction error, such as the Root Mean Square Error (RMSE) and Mean Square Error (MSE), which decreases by about an order of magnitude, was achieved. The AD technique formulated for large-scale retail sector also represents a good way to calibrate the training model.

Symmetry ◽  
2020 ◽  
Vol 12 (9) ◽  
pp. 1566 ◽  
Author(s):  
Zeinab Shahbazi ◽  
Debapriya Hazra ◽  
Sejoon Park ◽  
Yung Cheol Byun

With the spread of COVID-19, the “untact” culture in South Korea is expanding and customers are increasingly seeking for online services. A recommendation system serves as a decision-making indicator that helps users by suggesting items to be purchased in the future by exploring the symmetry between multiple user activity characteristics. A plethora of approaches are employed by the scientific community to design recommendation systems, including collaborative filtering, stereotyping, and content-based filtering, etc. The current paradigm of recommendation systems favors collaborative filtering due to its significant potential to closely capture the interest of a user as compared to other approaches. The collaborative filtering harnesses features like user-profile details, visited pages, and click information to determine the interest of a user, thereby recommending the items that are related to the user’s interest. The existing collaborative filtering approaches exploit implicit and explicit features and report either good classification or prediction outcome. These systems fail to exhibit good results for both measures at the same time. We believe that avoiding the recommendation of those items that have already been purchased could contribute to overcoming the said issue. In this study, we present a collaborative filtering-based algorithm to tackle big data of user with symmetric purchasing order and repetitive purchased products. The proposed algorithm relies on combining extreme gradient boosting machine learning architecture with word2vec mechanism to explore the purchased products based on the click patterns of users. Our algorithm improves the accuracy of predicting the relevant products to be recommended to the customers that are likely to be bought. The results are evaluated on the dataset that contains click-based features of users from an online shopping mall in Jeju Island, South Korea. We have evaluated Mean Absolute Error, Mean Square Error, and Root Mean Square Error for our proposed methodology and also other machine learning algorithms. Our proposed model generated the least error rate and enhanced the prediction accuracy of the recommendation system compared to other traditional approaches.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Jiuzhou Jiang ◽  
Hao Pan ◽  
Mobai Li ◽  
Bao Qian ◽  
Xianfeng Lin ◽  
...  

AbstractOsteosarcoma is the most common bone malignancy, with the highest incidence in children and adolescents. Survival rate prediction is important for improving prognosis and planning therapy. However, there is still no prediction model with a high accuracy rate for osteosarcoma. Therefore, we aimed to construct an artificial intelligence (AI) model for predicting the 5-year survival of osteosarcoma patients by using extreme gradient boosting (XGBoost), a large-scale machine-learning algorithm. We identified cases of osteosarcoma in the Surveillance, Epidemiology, and End Results (SEER) Research Database and excluded substandard samples. The study population was 835 and was divided into the training set (n = 668) and validation set (n = 167). Characteristics selected via survival analyses were used to construct the model. Receiver operating characteristic (ROC) curve and decision curve analyses were performed to evaluate the prediction. The accuracy of the prediction model was excellent both in the training set (area under the ROC curve [AUC] = 0.977) and the validation set (AUC = 0.911). Decision curve analyses proved the model could be used to support clinical decisions. XGBoost is an effective algorithm for predicting 5-year survival of osteosarcoma patients. Our prediction model had excellent accuracy and is therefore useful in clinical settings.


2020 ◽  
Vol 39 (5) ◽  
pp. 7605-7620 ◽  
Author(s):  
Dhivya Elavarasan ◽  
Durai Raj Vincent

The development in science and technical intelligence has incited to represent an extensive amount ofdata from various fields of agriculture. Therefore an objective rises up for the examination of the available data and integrating with processes like crop enhancement, yield prediction, examination of plant infections etc. Machine learning has up surged with tremendous processing techniques to perceive new contingencies in the multi-disciplinary agrarian advancements. In this pa- per a novel hybrid regression algorithm, reinforced extreme gradient boosting is proposed which displays essentially improved execution over traditional machine learning algorithms like artificial neural networks, deep Q-Network, gradient boosting, ran- dom forest and decision tree. Extreme gradient boosting constructs new models, which are essentially, decision trees learning from the mistakes of their predecessors by optimizing the gradient descent loss function. The proposed hybrid model performs reinforcement learning at every node during the node splitting process of the decision tree construction. This leads to effective utilizationofthesamplesbyselectingtheappropriatesplitattributeforenhancedperformance. Model’sperformanceisevaluated by means of Mean Square Error, Root Mean Square Error, Mean Absolute Error, and Coefficient of Determination. To assure a fair assessment of the results, the model assessment is performed on both training and test dataset. The regression diagnostic plots from residuals and the results obtained evidently delineates the fact that proposed hybrid approach performs better with reduced error measure and improved accuracy of 94.15% over the other machine learning algorithms. Also the performance of probability density function for the proposed model delineates that, it can preserve the actual distributional characteristics of the original crop yield data more approximately when compared to the other experimented machine learning models.


2019 ◽  
Vol 11 (13) ◽  
pp. 1598 ◽  
Author(s):  
Hua Su ◽  
Xin Yang ◽  
Wenfang Lu ◽  
Xiao-Hai Yan

Retrieving multi-temporal and large-scale thermohaline structure information of the interior of the global ocean based on surface satellite observations is important for understanding the complex and multidimensional dynamic processes within the ocean. This study proposes a new ensemble learning algorithm, extreme gradient boosting (XGBoost), for retrieving subsurface thermohaline anomalies, including the subsurface temperature anomaly (STA) and the subsurface salinity anomaly (SSA), in the upper 2000 m of the global ocean. The model combines surface satellite observations and in situ Argo data for estimation, and uses root-mean-square error (RMSE), normalized root-mean-square error (NRMSE), and R2 as accuracy evaluations. The results show that the proposed XGBoost model can easily retrieve subsurface thermohaline anomalies and outperforms the gradient boosting decision tree (GBDT) model. The XGBoost model had good performance with average R2 values of 0.69 and 0.54, and average NRMSE values of 0.035 and 0.042, for STA and SSA estimations, respectively. The thermohaline anomaly patterns presented obvious seasonal variation signals in the upper layers (the upper 500 m); however, these signals became weaker as the depth increased. The model performance fluctuated, with the best performance in October (autumn) for both STA and SSA, and the lowest accuracy occurred in January (winter) for STA and April (spring) for SSA. The STA estimation error mainly occurred in the El Niño-Southern Oscillation (ENSO) region in the upper ocean and the boundary of the ocean basins in the deeper ocean; meanwhile, the SSA estimation error presented a relatively even distribution. The wind speed anomalies, including the u and v components, contributed more to the XGBoost model for both STA and SSA estimations than the other surface parameters; however, its importance at deeper layers decreased and the contributions of the other parameters increased. This study provides an effective remote sensing technique for subsurface thermohaline estimations and further promotes long-term remote sensing reconstructions of internal ocean parameters.


2019 ◽  
Vol 11 (12) ◽  
pp. 1505 ◽  
Author(s):  
Heng Zhang ◽  
Anwar Eziz ◽  
Jian Xiao ◽  
Shengli Tao ◽  
Shaopeng Wang ◽  
...  

Accurate mapping of vegetation is a premise for conserving, managing, and sustainably using vegetation resources, especially in conditions of intensive human activities and accelerating global changes. However, it is still challenging to produce high-resolution multiclass vegetation map in high accuracy, due to the incapacity of traditional mapping techniques in distinguishing mosaic vegetation classes with subtle differences and the paucity of fieldwork data. This study created a workflow by adopting a promising classifier, extreme gradient boosting (XGBoost), to produce accurate vegetation maps of two strikingly different cases (the Dzungarian Basin in China and New Zealand) based on extensive features and abundant vegetation data. For the Dzungarian Basin, a vegetation map with seven vegetation types, 17 subtypes, and 43 associations was produced with an overall accuracy of 0.907, 0.801, and 0.748, respectively. For New Zealand, a map of 10 habitats and a map of 41 vegetation classes were produced with 0.946, and 0.703 overall accuracy, respectively. The workflow incorporating simplified field survey procedures outperformed conventional field survey and remote sensing based methods in terms of accuracy and efficiency. In addition, it opens a possibility of building large-scale, high-resolution, and timely vegetation monitoring platforms for most terrestrial ecosystems worldwide with the aid of Google Earth Engine and citizen science programs.


2020 ◽  
Vol 21 (S13) ◽  
Author(s):  
Ke Li ◽  
Sijia Zhang ◽  
Di Yan ◽  
Yannan Bin ◽  
Junfeng Xia

Abstract Background Identification of hot spots in protein-DNA interfaces provides crucial information for the research on protein-DNA interaction and drug design. As experimental methods for determining hot spots are time-consuming, labor-intensive and expensive, there is a need for developing reliable computational method to predict hot spots on a large scale. Results Here, we proposed a new method named sxPDH based on supervised isometric feature mapping (S-ISOMAP) and extreme gradient boosting (XGBoost) to predict hot spots in protein-DNA complexes. We obtained 114 features from a combination of the protein sequence, structure, network and solvent accessible information, and systematically assessed various feature selection methods and feature dimensionality reduction methods based on manifold learning. The results show that the S-ISOMAP method is superior to other feature selection or manifold learning methods. XGBoost was then used to develop hot spots prediction model sxPDH based on the three dimensionality-reduced features obtained from S-ISOMAP. Conclusion Our method sxPDH boosts prediction performance using S-ISOMAP and XGBoost. The AUC of the model is 0.773, and the F1 score is 0.713. Experimental results on benchmark dataset indicate that sxPDH can achieve generally better performance in predicting hot spots compared to the state-of-the-art methods.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Mingyue Xue ◽  
Yinxia Su ◽  
Chen Li ◽  
Shuxia Wang ◽  
Hua Yao

Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.


2018 ◽  
Vol 12 (2) ◽  
pp. 85-98 ◽  
Author(s):  
Barry E King ◽  
Jennifer L Rice ◽  
Julie Vaughan

Research predicting National Hockey League average attendance is presented. The seasons examined are the 2013 hockey season through the beginning of the 2017 hockey season. Multiple linear regression and three machine learning algorithms – random forest, M5 prime, and extreme gradient boosting – are employed to predict out-of-sample average home game attendance. Extreme gradient boosting generated the lowest out-of-sample root mean square error.  The team identifier (team name), the number of Twitter followers (a surrogate for team popularity), median ticket price, and arena capacity have appeared as the top four predictor variables. 


2020 ◽  
Author(s):  
Frank Kreuwel ◽  
Chiel van Heerwaarden

<p>Variability of solar irradiance is an important factor concerning large-scale integration of solar photovoltaics (PV) systems onto the electricity grid. Calculations of irradiance are computationally expensive, leaving operational meso-scale forecasting models struggling to achieve accurate results. Moreover, such models deliver outputs at a temporal resolution in the order of hours, whereas from a grid-integration point of view, minute-to-minute variability is a major concern. In previous work, we found that absolute power peaks in the order of seconds are up to 18% higher compared to 15-minute resolution for irradiance and even upwards of 22% higher for household PV systems. Moreover, these maximum peaks in output power are solely observed under mixed-cloud conditions, for which alse the greatest variability is found. In this work we present a machine-learning model which can forecast sub-resolution variability of irradiance, based on standard meso-scale outputs of the HARMONIE model of the The Royal Netherlands Meteorological Institute (KNMI). For training and validation, irradiance measurements obtained at a 1-second interval are used of the Baseline Surface Radiation Network (BSRN) site of Cabauw. A tree-based model was employed, for which the optimum members were constructed using extreme gradient boosting. In this work, we explore the dominant features of the model and link the machine-learned-relations to meteorological processes and dynamics. This research was executed in collaboration with the Distribution Grid Operator Alliander.</p>


2021 ◽  
Vol 10 (10) ◽  
pp. 680
Author(s):  
Annan Yang ◽  
Chunmei Wang ◽  
Guowei Pang ◽  
Yongqing Long ◽  
Lei Wang ◽  
...  

Gully erosion is the most severe type of water erosion and is a major land degradation process. Gully erosion susceptibility mapping (GESM)’s efficiency and interpretability remains a challenge, especially in complex terrain areas. In this study, a WoE-MLC model was used to solve the above problem, which combines machine learning classification algorithms and the statistical weight of evidence (WoE) model in the Loess Plateau. The three machine learning (ML) algorithms utilized in this research were random forest (RF), gradient boosted decision trees (GBDT), and extreme gradient boosting (XGBoost). The results showed that: (1) GESM were well predicted by combining both machine learning regression models and WoE-MLC models, with the area under the curve (AUC) values both greater than 0.92, and the latter was more computationally efficient and interpretable; (2) The XGBoost algorithm was more efficient in GESM than the other two algorithms, with the strongest generalization ability and best performance in avoiding overfitting (averaged AUC = 0.947), followed by the RF algorithm (averaged AUC = 0.944), and GBDT algorithm (averaged AUC = 0.938); and (3) slope gradient, land use, and altitude were the main factors for GESM. This study may provide a possible method for gully erosion susceptibility mapping at large scale.


Sign in / Sign up

Export Citation Format

Share Document