scholarly journals Customer churn analysis using XGBoosted decision trees

Author(s):  
Muthupriya Vasudevan ◽  
Revathi Sathya Narayanan ◽  
Sabiyath Fatima Nakeeb ◽  
Abhishek Abhishek

Customer relationship management (CRM) is an important element in all forms of industry. This process involves ensuring that the customers of a business are satisfied with the product or services that they are paying for. Since most businesses collect and store large volumes of data about their customers; it is easy for the data analysts to use that data and perform predictive analysis. One aspect of this includes customer retention and customer churn. Customer churn is defined as the concept of understanding whether or not a customer of the company will stop using the product or service in future. In this paper a supervised machine learning algorithm has been implemented using Python to perform customer churn analysis on a given data-set of Telco, a mobile telecommunication company. This is achieved by building a decision tree model based on historical data provided by the company on the platform of Kaggle. This report also investigates the utility of extreme gradient boosting (XGBoost) library in the gradient boosting framework (XGB) of Python for its portable and flexible functionality which can be used to solve many data science related problems highly efficiently. The implementation result shows the accuracy is comparatively improved in XGBoost than other learning models.

Data Science in healthcare is a innovative and capable for industry implementing the data science applications. Data analytics is recent science in to discover the medical data set to explore and discover the disease. It’s a beginning attempt to identify the disease with the help of large amount of medical dataset. Using this data science methodology, it makes the user to find their disease without the help of health care centres. Healthcare and data science are often linked through finances as the industry attempts to reduce its expenses with the help of large amounts of data. Data science and medicine are rapidly developing, and it is important that they advance together. Health care information is very effective in the society. In a human life day to day heart disease had increased. Based on the heart disease to monitor different factors in human body to analyse and prevent the heart disease. To classify the factors using the machine learning algorithms and to predict the disease is major part. Major part of involves machine level based supervised learning algorithm such as SVM, Naviebayes, Decision Trees and Random forest.


2021 ◽  
Author(s):  
Domingos Andrade ◽  
Luigi Ribeiro ◽  
Agnaldo Lopes ◽  
Jorge Amaral ◽  
Pedro Lopes de Melo

Abstract BackgroundThe use of machine learning (ML) methods would improve the diagnosis of respiratory changes in systemic sclerosis (SSc). This paper evaluates the performance of several ML algorithms associated with the respiratory oscillometry analysis to aid in the diagnostic of respiratory changes in SSc. We also find out the best configuration for this task.MethodsOscillometric and spirometric exams were performed in 82 individuals, including controls (n=30), and patients with systemic sclerosis with normal (n=22) and abnormal (n=30) spirometry. Multiple instance classifiers and different supervised machine learning techniques were investigated, including k-nearest neighbours (KNN), random forests (RF), AdaBoost with decision trees (ADAB), and Extreme Gradient Boosting (XGB).Results and discussionThe first experiment of this study showed that the best oscillometric parameter (BOP) was dynamic compliance. In the scenario Control Group versus Patients with Sclerosis and normal spirometry (CGvsPSNS), it provided moderate accuracy (AUC=0.77). In the scenario Control Group versus Patients with Sclerosis and Altered spirometry (CGvsPSAS), the BOP obtained high accuracy (AUC=0.94). In the second experiment, the ML techniques were used. In CGvsPSNS, KNN achieved the best result (AUC=0.90), significantly improving the accuracy in comparison with the BOP (p<0.01), while in CGvsPSAS, RF obtained the best results (AUC=0.97), also significantly improving the diagnostic accuracy (p<0.05). In the third, fourth, fifth, and sixth experiments, the use of different feature selection techniques allowed us to spot the best oscillometric parameters. They all show a small increase in diagnostic accuracy in CGvsPSNS, respectively 0.87,0.86, 0.82, 0.84, while in the CGvsPSAS, the performance of the best classifier remained the same (AUC=0.97). ConclusionsOscillometric principles combined with machine learning algorithm provides a new method for the diagnosis of respiratory changes in patients with systemic sclerosis. The findings of the present study provide evidence that this combination may play an important role in the early diagnosis of respiratory changes in these patients.


2020 ◽  
Vol 9 (3) ◽  
pp. 658 ◽  
Author(s):  
Jun-Cheng Weng ◽  
Tung-Yeh Lin ◽  
Yuan-Hsiung Tsai ◽  
Man Teng Cheok ◽  
Yi-Peng Eve Chang ◽  
...  

It is estimated that at least one million people die by suicide every year, showing the importance of suicide prevention and detection. In this study, an autoencoder and machine learning model was employed to predict people with suicidal ideation based on their structural brain imaging. The subjects in our generalized q-sampling imaging (GQI) dataset consisted of three groups: 41 depressive patients with suicidal ideation (SI), 54 depressive patients without suicidal thoughts (NS), and 58 healthy controls (HC). In the GQI dataset, indices of generalized fractional anisotropy (GFA), isotropic values of the orientation distribution function (ISO), and normalized quantitative anisotropy (NQA) were separately trained in different machine learning models. A convolutional neural network (CNN)-based autoencoder model, the supervised machine learning algorithm extreme gradient boosting (XGB), and logistic regression (LR) were used to discriminate SI subjects from NS and HC subjects. After five-fold cross validation, separate data were tested to obtain the accuracy, sensitivity, specificity, and area under the curve of each result. Our results showed that the best pattern of structure across multiple brain locations can classify suicidal ideates from NS and HC with a prediction accuracy of 85%, a specificity of 100% and a sensitivity of 75%. The algorithms developed here might provide an objective tool to help identify suicidal ideation risk among depressed patients alongside clinical assessment.


2022 ◽  
Vol 17 (1) ◽  
pp. 165-198
Author(s):  
Kamil Matuszelański ◽  
Katarzyna Kopczewska

This study is a comprehensive and modern approach to predict customer churn in the example of an e-commerce retail store operating in Brazil. Our approach consists of three stages in which we combine and use three different datasets: numerical data on orders, textual after-purchase reviews and socio-geo-demographic data from the census. At the pre-processing stage, we find topics from text reviews using Latent Dirichlet Allocation, Dirichlet Multinomial Mixture and Gibbs sampling. In the spatial analysis, we apply DBSCAN to get rural/urban locations and analyse neighbourhoods of customers located with zip codes. At the modelling stage, we apply machine learning extreme gradient boosting and logistic regression. The quality of models is verified with area-under-curve and lift metrics. Explainable artificial intelligence represented with a permutation-based variable importance and a partial dependence profile help to discover the determinants of churn. We show that customers’ propensity to churn depends on: (i) payment value for the first order, number of items bought and shipping cost; (ii) categories of the products bought; (iii) demographic environment of the customer; and (iv) customer location. At the same time, customers’ propensity to churn is not influenced by: (i) population density in the customer’s area and division into rural and urban areas; (ii) quantitative review of the first purchase; and (iii) qualitative review summarised as a topic.


2020 ◽  
pp. 865-874
Author(s):  
Enrico Santus ◽  
Tal Schuster ◽  
Amir M. Tahmasebi ◽  
Clara Li ◽  
Adam Yala ◽  
...  

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.


2018 ◽  
Vol 7 (04) ◽  
pp. 871-888 ◽  
Author(s):  
Sophie J. Lee ◽  
Howard Liu ◽  
Michael D. Ward

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.


Author(s):  
Saifur Rahman ◽  
Muhammad Irfan ◽  
Mohsin Raza ◽  
Khawaja Moyeezullah Ghori ◽  
Shumayla Yaqoob ◽  
...  

Physical activity is essential for physical and mental health, and its absence is highly associated with severe health conditions and disorders. Therefore, tracking activities of daily living can help promote quality of life. Wearable sensors in this regard can provide a reliable and economical means of tracking such activities, and such sensors are readily available in smartphones and watches. This study is the first of its kind to develop a wearable sensor-based physical activity classification system using a special class of supervised machine learning approaches called boosting algorithms. The study presents the performance analysis of several boosting algorithms (extreme gradient boosting—XGB, light gradient boosting machine—LGBM, gradient boosting—GB, cat boosting—CB and AdaBoost) in a fair and unbiased performance way using uniform dataset, feature set, feature selection method, performance metric and cross-validation techniques. The study utilizes the Smartphone-based dataset of thirty individuals. The results showed that the proposed method could accurately classify the activities of daily living with very high performance (above 90%). These findings suggest the strength of the proposed system in classifying activity of daily living using only the smartphone sensor’s data and can assist in reducing the physical inactivity patterns to promote a healthier lifestyle and wellbeing.


2011 ◽  
Vol 19 (4) ◽  
pp. 409-433 ◽  
Author(s):  
Francisco Cantú ◽  
Sebastián M. Saiegh

In this paper, we introduce an innovative method to diagnose electoral fraud using vote counts. Specifically, we use synthetic data to develop and train a fraud detection prototype. We employ a naive Bayes classifier as our learning algorithm and rely on digital analysis to identify the features that are most informative about class distinctions. To evaluate the detection capability of the classifier, we use authentic data drawn from a novel data set of district-level vote counts in the province of Buenos Aires (Argentina) between 1931 and 1941, a period with a checkered history of fraud. Our results corroborate the validity of our approach: The elections considered to be irregular (legitimate) by most historical accounts are unambiguously classified as fraudulent (clean) by the learner. More generally, our findings demonstrate the feasibility of generating and using synthetic data for training and testing an electoral fraud detection system.


Water ◽  
2021 ◽  
Vol 13 (19) ◽  
pp. 2633
Author(s):  
Jie Yu ◽  
Yitong Cao ◽  
Fei Shi ◽  
Jiegen Shi ◽  
Dibo Hou ◽  
...  

Three dimensional fluorescence spectroscopy has become increasingly useful in the detection of organic pollutants. However, this approach is limited by decreased accuracy in identifying low concentration pollutants. In this research, a new identification method for organic pollutants in drinking water is accordingly proposed using three-dimensional fluorescence spectroscopy data and a deep learning algorithm. A novel application of a convolutional autoencoder was designed to process high-dimensional fluorescence data and extract multi-scale features from the spectrum of drinking water samples containing organic pollutants. Extreme Gradient Boosting (XGBoost), an implementation of gradient-boosted decision trees, was used to identify the organic pollutants based on the obtained features. Method identification performance was validated on three typical organic pollutants in different concentrations for the scenario of accidental pollution. Results showed that the proposed method achieved increasing accuracy, in the case of both high-(>10 μg/L) and low-(≤10 μg/L) concentration pollutant samples. Compared to traditional spectrum processing techniques, the convolutional autoencoder-based approach enabled obtaining features of enhanced detail from fluorescence spectral data. Moreover, evidence indicated that the proposed method maintained the detection ability in conditions whereby the background water changes. It can effectively reduce the rate of misjudgments associated with the fluctuation of drinking water quality. This study demonstrates the possibility of using deep learning algorithms for spectral processing and contamination detection in drinking water.


West Nile Virus (WNV) is a disease caused by mosquitoes where human beings get infected by the mosquito’s bite. The disease is considered to be a serious threat to the society especially in the United States where it is frequently found in localities having water bodies. The traditional approach is to collect the traps of mosquitoes from a locality and check whether they are infected with virus. If there is a virus found then that locality is sprayed with pesticides. But this process is very time consuming and requires a lot of financial support. Machine learning methods can provide an efficient approach to predict the presence of virus in a locality using data related to the location and weather. This paper uses the dataset present in Kaggle which includes information related to the traps found in the locality and also about the information related to the locality’s weather. The dataset is found to be imbalanced hence Synthetic Minority Over sampling Technique (SMOTE), an upsampling method, is used to sample the dataset to balance it. Ensemble learning classifiers like random forest, gradient boosting and Extreme Gradient Boosting (XGB). The performance of ensemble classifiers is compared with the performance of the best supervised learning algorithm, SVM. Among the models, XGB gave the highest F-1 score of 92.93 by performing marginally better than random forest (92.78) and also SVM (91.16).


Sign in / Sign up

Export Citation Format

Share Document