scholarly journals The benefits of segmentation: Evidence from a South African bank and other studies

2017 ◽  
Vol 113 (9/10) ◽  
Author(s):  
Douw G. Breed ◽  
Tanja Verster

We applied different modelling techniques to six data sets from different disciplines in the industry, on which predictive models can be developed, to demonstrate the benefit of segmentation in linear predictive modelling. We compared the model performance achieved on the data sets to the performance of popular non-linear modelling techniques, by first segmenting the data (using unsupervised, semi-supervised, as well as supervised methods) and then fitting a linear modelling technique. A total of eight modelling techniques was compared. We show that there is no one single modelling technique that always outperforms on the data sets. Specifically considering the direct marketing data set from a local South African bank, it is observed that gradient boosting performed the best. Depending on the characteristics of the data set, one technique may outperform another. We also show that segmenting the data benefits the performance of the linear modelling technique in the predictive modelling context on all data sets considered. Specifically, of the three segmentation methods considered, the semi-supervised segmentation appears the most promising.

2019 ◽  
Vol 115 (3/4) ◽  
Author(s):  
Douw G. Breed ◽  
Tanja Verster

Segmentation of data for the purpose of enhancing predictive modelling is a well-established practice in the banking industry. Unsupervised and supervised approaches are the two main types of segmentation and examples of improved performance of predictive models exist for both approaches. However, both focus on a single aspect – either target separation or independent variable distribution – and combining them may deliver better results. This combination approach is called semi-supervised segmentation. Our objective was to explore four new semi-supervised segmentation techniques that may offer alternative strengths. We applied these techniques to six data sets from different domains, and compared the model performance achieved. The original semi-supervised segmentation technique was the best for two of the data sets (as measured by the improvement in validation set Gini), but others outperformed for the other four data sets. Significance: We propose four newly developed semi-supervised segmentation techniques that can be used as additional tools for segmenting data before fitting a logistic regression. In all comparisons, using semi-supervised segmentation before fitting a logistic regression improved the modelling performance (as measured by the Gini coefficient on the validation data set) compared to using unsegmented logistic regression.


2017 ◽  
Vol 10 (2) ◽  
pp. 695-708 ◽  
Author(s):  
Simon Ruske ◽  
David O. Topping ◽  
Virginia E. Foot ◽  
Paul H. Kaye ◽  
Warren R. Stanley ◽  
...  

Abstract. Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.


2017 ◽  
Vol 2017 ◽  
pp. 1-8 ◽  
Author(s):  
Janek Thomas ◽  
Tobias Hepp ◽  
Andreas Mayr ◽  
Bernd Bischl

We present a new variable selection method based on model-based gradient boosting and randomly permuted variables. Model-based boosting is a tool to fit a statistical model while performing variable selection at the same time. A drawback of the fitting lies in the need of multiple model fits on slightly altered data (e.g., cross-validation or bootstrap) to find the optimal number of boosting iterations and prevent overfitting. In our proposed approach, we augment the data set with randomly permuted versions of the true variables, so-called shadow variables, and stop the stepwise fitting as soon as such a variable would be added to the model. This allows variable selection in a single fit of the model without requiring further parameter tuning. We show that our probing approach can compete with state-of-the-art selection methods like stability selection in a high-dimensional classification benchmark and apply it on three gene expression data sets.


Author(s):  
Aki Koivu ◽  
Mikko Sairanen

AbstractModelling the risk of abnormal pregnancy-related outcomes such as stillbirth and preterm birth have been proposed in the past. Commonly they utilize maternal demographic and medical history information as predictors, and they are based on conventional statistical modelling techniques. In this study, we utilize state-of-the-art machine learning methods in the task of predicting early stillbirth, late stillbirth and preterm birth pregnancies. The aim of this experimentation is to discover novel risk models that could be utilized in a clinical setting. A CDC data set of almost sixteen million observations was used conduct feature selection, parameter optimization and verification of proposed models. An additional NYC data set was used for external validation. Algorithms such as logistic regression, artificial neural network and gradient boosting decision tree were used to construct individual classifiers. Ensemble learning strategies of these classifiers were also experimented with. The best performing machine learning models achieved 0.76 AUC for early stillbirth, 0.63 for late stillbirth and 0.64 for preterm birth while using a external NYC test data. The repeatable performance of our models demonstrates robustness that is required in this context. Our proposed novel models provide a solid foundation for risk prediction and could be further improved with the addition of biochemical and/or biophysical markers.


2013 ◽  
Vol 17 (11) ◽  
pp. 4323-4337 ◽  
Author(s):  
M. A. Sunyer ◽  
H. J. D. Sørup ◽  
O. B. Christensen ◽  
H. Madsen ◽  
D. Rosbjerg ◽  
...  

Abstract. In recent years, there has been an increase in the number of climate studies addressing changes in extreme precipitation. A common step in these studies involves the assessment of the climate model performance. This is often measured by comparing climate model output with observational data. In the majority of such studies the characteristics and uncertainties of the observational data are neglected. This study addresses the influence of using different observational data sets to assess the climate model performance. Four different data sets covering Denmark using different gauge systems and comprising both networks of point measurements and gridded data sets are considered. Additionally, the influence of using different performance indices and metrics is addressed. A set of indices ranging from mean to extreme precipitation properties is calculated for all the data sets. For each of the observational data sets, the regional climate models (RCMs) are ranked according to their performance using two different metrics. These are based on the error in representing the indices and the spatial pattern. In comparison to the mean, extreme precipitation indices are highly dependent on the spatial resolution of the observations. The spatial pattern also shows differences between the observational data sets. These differences have a clear impact on the ranking of the climate models, which is highly dependent on the observational data set, the index and the metric used. The results highlight the need to be aware of the properties of observational data chosen in order to avoid overconfident and misleading conclusions with respect to climate model performance.


1988 ◽  
Vol 110 (2) ◽  
pp. 172-179 ◽  
Author(s):  
H. El-Tahan ◽  
S. Venkatesh ◽  
M. El-Tahan

This paper describes the evaluation of a model for predicting the drift of iceberg ensembles. The model was developed in preparation for providing an iceberg forecasting service off the Canadian east coast north of about 45°N. It was envisaged that 1–5 day forecasts of iceberg ensemble drift will be available. Following a critical examination of all available data, 10 data sets containing up to 404 icebergs in the Grand Banks area off Newfoundland were selected for detailed study. The winds measured in the vicinity of the study area as well as the detailed current system developed by the International Ice Patrol were used as inputs to the model. A discussion on the accuracy and limitations of the input data is presented. Qualitative and quantitative criteria were used to evaluate model performance. Applying these criteria to the results of the computer simulations, it is shown that the model provides good predictions. The degree of predictive success varied from one data set to another. The study demonstrated the validity of the assumption of random positioning for icebergs within a grid block, especially for ensembles with large numbers of icebergs. It was found that an “average” iceberg size can be used to represent all icebergs. The study also showed that in order to achieve improved results it will be necessary to account for the deterioration (complete melting of icebergs), especially during the summer months.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 1264
Author(s):  
Nisha Kumari Devaraj ◽  
Ameer Al Mubarak Hamzah

Background: Since adsorption is a complex process, numerous models and theories have been devised to gain general understanding of its underlying mechanisms. The interaction between the adsorbates and adsorbents can be identified via modelling of the adsorption data with different adsorption isotherms as well as kinetic models. Many studies are also focused on developing predictive modelling techniques to facilitate accurate prediction of future adsorption trends. Methods: In this study, a predictive model was developed based on a multiple linear regression technique using existing data of As(V) adsorption onto several coated and uncoated magnetite samples. To understand the mechanisms and interactions involved, the data was first modelled using either Temkin or Freundlich linear isotherms.  The predicted value is a single data point extension from the training data set. Subsequently, the predicted outcome and the experimental values were compared using multiple error functions to assess the predictive model’s performance. Results: In addition, certain values were compared to that obtained from the literature, and the results were found to have low error margins. Conclusion: To further gauge the effectiveness of the proposed model in accurately predicting future adsorption trends, it should be further tested on different adsorbent and adsorbate combinations.


2009 ◽  
Vol 20 (1) ◽  
pp. 25-34 ◽  
Author(s):  
Daniel Ciolkosz

A methodology is presented for the correction and filling of solar radiation data at sites within South Africa, with the aim of creating a continuous, hourly-timestep dataset for multiple locations. Data from twenty sites, collected by the Agricultural Research Council, are analysed with regard to the amount of data requiring offset or multiplier adjustment, as well as the amount of bad data. A range correction algorithm is implemented based on the 90th percentile (10% exceedance) hourly irradiance, as a function of site latitude and elevation. The resulting, corrected data set is given the title: South African Solar Radiation Database (SASRAD). Comparisons are made with two other solar radiation datasets, the South African Atlas of Agrohydrology and Climatology, and a limited set of older historical data from the South African Weather Service (SAWS). Results indicate that the SASRAD dataset matches well with other datasets, with major discrepancies apparently due to problems with the other data sets, rather than the SASRAD data. The Coefficient of Multiple Determination (R2) between the Atlas and SASRAD for monthly radiation is 0.927, and the mean error between three of the SASRAD sites and the corresponding SAWS data is 1.1 MJ m-2 d-1. The fraction of data requiring correction varied from 11% to 100%, depending on the site. The range correction algorithm was successful at correcting data that had been subject to incorrect calibration, and did not remove annual trends in mean radiation levels.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Zhenyu Lu ◽  
Cheng Zheng ◽  
Tingya Yang

Visibility forecasting in offshore areas faces the problems of low observational data and complex weather. This paper proposes an intelligent prediction method of offshore visibility based on temporal convolutional network (TCN) and transfer learning to solve the problem. First, preprocess the visibility data sets of the source and target domains to improve the quality of the data. Then, build a model based on temporal convolutional network and transfer learning (TCN_TL) to learn the visibility data of the source domain. Finally, after transferring the knowledge learned from a large amount of data in the source domain, the model learns the small data set in the target domain. After completing the training, the model data of the European Mid-Range Weather Forecast Center (ECMWF) meteorological field were selected to test the model performance. The method proposed in this paper has achieved relatively good results in the visibility forecast of Qiongzhou Strait. Taking Haikou Station in the spring and winter of 2018 as an example, the forecast error is significantly lower than that before the transfer learning, and the forecast score is increased by 0.11 within the 0-1 km level and the 24 h forecast period. Compared with the CUACE forecast results, the forecast error of TCN_TL is smaller than that of the former, and the TS score is improved by 0.16. The results show that under the condition of small data sets, transfer learning improves the prediction performance of the model, and TCN_TL performs better than other deep learning methods and CUACE.


Author(s):  
Meenakshi Srivastava

IoT-based communication between medical devices has encouraged the healthcare industry to use automated systems which provide effective insight from the massive amount of gathered data. AI and machine learning have played a major role in the design of such systems. Accuracy and validation are considered, since copious training data is required in a neural network (NN)-based deep learning model. This is hardly feasible in medical research, because the size of data sets is constrained by complexity and high cost experiments. The availability of limited sample data validation of NN remains a concern. The prediction of outcomes on a NN trained on a smaller data set cannot guarantee performance and exhibits unstable behaviors. Surrogate data-based validation of NN can be viewed as a solution. In the current chapter, the classification of breast tissue data by a NN model has been detailed. In the absence of a huge data set, a surrogate data-based validation approach has been applied. The discussed study can be applied for predictive modelling for applications described by small data sets.


Sign in / Sign up

Export Citation Format

Share Document