An empirical investigation of alternative semi-supervised segmentation methodologies

Douw G. Breed; Tanja Verster

doi:10.17159/sajs.2019/5359

An empirical investigation of alternative semi-supervised segmentation methodologies

South African Journal of Science ◽

10.17159/sajs.2019/5359 ◽

2019 ◽

Vol 115 (3/4) ◽

Author(s):

Douw G. Breed ◽

Tanja Verster

Keyword(s):

Logistic Regression ◽

Model Performance ◽

Predictive Modelling ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Supervised Segmentation ◽

Improved Performance ◽

Validation Set ◽

Combination Approach

Segmentation of data for the purpose of enhancing predictive modelling is a well-established practice in the banking industry. Unsupervised and supervised approaches are the two main types of segmentation and examples of improved performance of predictive models exist for both approaches. However, both focus on a single aspect – either target separation or independent variable distribution – and combining them may deliver better results. This combination approach is called semi-supervised segmentation. Our objective was to explore four new semi-supervised segmentation techniques that may offer alternative strengths. We applied these techniques to six data sets from different domains, and compared the model performance achieved. The original semi-supervised segmentation technique was the best for two of the data sets (as measured by the improvement in validation set Gini), but others outperformed for the other four data sets. Significance: We propose four newly developed semi-supervised segmentation techniques that can be used as additional tools for segmenting data before fitting a logistic regression. In all comparisons, using semi-supervised segmentation before fitting a logistic regression improved the modelling performance (as measured by the Gini coefficient on the validation data set) compared to using unsegmented logistic regression.

Download Full-text

The benefits of segmentation: Evidence from a South African bank and other studies

South African Journal of Science ◽

10.17159/sajs.2017/20160345 ◽

2017 ◽

Vol 113 (9/10) ◽

Cited By ~ 2

Author(s):

Douw G. Breed ◽

Tanja Verster

Keyword(s):

South African ◽

Direct Marketing ◽

Model Performance ◽

Predictive Modelling ◽

Modelling Technique ◽

Gradient Boosting ◽

Data Sets ◽

Data Set ◽

Linear Modelling ◽

Modelling Techniques

We applied different modelling techniques to six data sets from different disciplines in the industry, on which predictive models can be developed, to demonstrate the benefit of segmentation in linear predictive modelling. We compared the model performance achieved on the data sets to the performance of popular non-linear modelling techniques, by first segmenting the data (using unsupervised, semi-supervised, as well as supervised methods) and then fitting a linear modelling technique. A total of eight modelling techniques was compared. We show that there is no one single modelling technique that always outperforms on the data sets. Specifically considering the direct marketing data set from a local South African bank, it is observed that gradient boosting performed the best. Depending on the characteristics of the data set, one technique may outperform another. We also show that segmenting the data benefits the performance of the linear modelling technique in the predictive modelling context on all data sets considered. Specifically, of the three segmentation methods considered, the semi-supervised segmentation appears the most promising.

Download Full-text

Development and external validation of a logistic regression derived algorithm to estimate a 12-month post open defecation free slippage risk

Journal of Humanitarian Engineering ◽

10.36479/jhe.v7i1.139 ◽

2019 ◽

Vol 7 (1) ◽

Author(s):

Warren Mukelabai Simangolwa

Keyword(s):

Logistic Regression ◽

External Validation ◽

Data Sets ◽

Nonparametric Bootstrap ◽

Open Defecation ◽

Validation Data ◽

Data Set ◽

C Statistic ◽

Sanitation And Hygiene ◽

Development And Validation

Appropriate open defecation free (ODF) sustainability interventions are key to further mobilise communities to consume sanitation and hygiene products and services that enhance household’s quality of life and embed household behavioural change for heathier communities. This study aims to develop a logistic regression derived risk algorithm to estimate a 12-month ODF slippage risk and externally validate the model in an independent data set. ODF slippage occurs when one or more toilet adequacy parameters are no longer present for one or more toilets in a community. Data in the Zambia district health information software for water sanitation and hygiene management information system for Chungu and Chabula chiefdoms was used for the study. The data was retrieved from the date of chief Chungu and Chabula chiefdoms' attainment of ODF status in October 2016 for 12 months until September 2017 for the development and validation data sets respectively. Data was assumed to be missing completely at random and the complete case analysis approach was used. The events per variables were satisfactory for both the development and validation data sets. Multivariable regression with a backwards selection procedure was used to decide candidate predictor variables with p < 0.05 meriting inclusion. To correct for optimism, the study compared amount of heuristic shrinkage by comparing the model’s apparent C-statistic to the C- statistic computed by nonparametric bootstrap resampling. In the resulting model, an increase in the covariates ‘months after ODF attainment’, ‘village population’ and ‘latrine built after CLTS’, were all associated with a higher probability of ODF slippage. Conversely, an increase in the covariate ‘presence of a handwashing station with soap’, was associated with reduced probability of ODF slippage. The predictive performance of the model was improved by the heuristic shrinkage factor of 0.988. The external validation confirmed good prediction performance with an area under the receiver operating characteristic curve of 0.85 and no significant lack of fit (Hosmer-Lemeshow test: p = 0.246). The results must be interpreted with caution in regions where the ODF definitions, culture and other factors are different from those asserted in the study.

Download Full-text

Abstract P192: Metabolomic Profiles Associated with Coronary Heart Disease in Women

Circulation ◽

10.1161/circ.133.suppl_1.p192 ◽

2016 ◽

Vol 133 (suppl_1) ◽

Author(s):

Nina P Paynter ◽

Raji Balasubramanian ◽

Shuba Gopal ◽

Franco Giulianini ◽

Leslie Tinker ◽

...

Keyword(s):

Risk Factors ◽

Coronary Heart Disease ◽

Logistic Regression ◽

Heart Disease ◽

Tier 2 ◽

Validation Data ◽

Data Set ◽

Chd Risk ◽

Tier 1 ◽

Chd Risk Factors

Background: Prior studies of metabolomic profiles and coronary heart disease (CHD) have been limited by relatively small case numbers and scant data in women. Methods: The discovery set examined 371 metabolites in 400 confirmed, incident CHD cases and 400 controls (frequency matched on age, race/ethnicity, hysterectomy status and time of enrollment) in the Women’s Health Initiative Observational Study (WHI-OS). All selected metabolites were validated in a separate set of 394 cases and 397 matched controls drawn from the placebo arms of the WHI Hormone Therapy trials and the WHI-OS. Discovery used 4 methods: false-discovery rate (FDR) adjusted logistic regression for individual metabolites, permutation corrected least absolute shrinkage and selection operator (LASSO) algorithms, sparse partial least squares discriminant analysis (PLS-DA) algorithms, and random forest algorithms. Each method was performed with matching factors only and with matching plus both medication use (aspirin, statins, anti-diabetics and anti-hypertensives) and traditional CHD risk factors (smoking, systolic blood pressure, diabetes, total and HDL cholesterol). Replication in the validation set was defined as a logistic regression coefficient of p<0.05 for the metabolites selected by 3 or 4 methods (tier 1), or a FDR adjusted p<0.05 for metabolites selected by only 1 or 2 methods (tier 2). Results: Sixty-seven metabolites were selected in the discovery data set (30 tier 1 and 37 tier 2). Twenty-six successfully replicated in the validation data set (21 tier 1 and 5 tier 2), with 25 significant with adjusting for matching factors only and 11 significant after additionally adjusting for medications and CHD risk factors. Validated metabolites included amino acids, sugars, nucleosides, eicosanoids, plasmologens, polyunsaturated phospholipids and highly saturated triglycerides. These include novel metabolites as well as metabolites such as glutamate/glutamine, which have been shown in other populations. Conclusions: Multiple metabolites in important physiological pathways with robust associations for risk of CHD in women were identified and replicated. These results may offer insights into biological mechanisms of CHD as well as identify potential markers of risk.

Download Full-text

Artificial intelligence method to classify ophthalmic emergency severity based on symptoms: a validation study

BMJ Open ◽

10.1136/bmjopen-2020-037161 ◽

2020 ◽

Vol 10 (7) ◽

pp. e037161

Author(s):

Hyunmin Ahn

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Model Performance ◽

Armed Forces ◽

Validation Data ◽

Data Set ◽

Orange Yellow ◽

Hospital Visits ◽

Input Variables ◽

Fully Connected

ObjectivesWe investigated the usefulness of machine learning artificial intelligence (AI) in classifying the severity of ophthalmic emergency for timely hospital visits.Study designThis retrospective study analysed the patients who first visited the Armed Forces Daegu Hospital between May and December 2019. General patient information, events and symptoms were input variables. Events, symptoms, diagnoses and treatments were output variables. The output variables were classified into four classes (red, orange, yellow and green, indicating immediate to no emergency cases). About 200 cases of the class-balanced validation data set were randomly selected before all training procedures. An ensemble AI model using combinations of fully connected neural networks with the synthetic minority oversampling technique algorithm was adopted.ParticipantsA total of 1681 patients were included.Major outcomesModel performance was evaluated using accuracy, precision, recall and F1 scores.ResultsThe accuracy of the model was 99.05%. The precision of each class (red, orange, yellow and green) was 100%, 98.10%, 92.73% and 100%. The recalls of each class were 100%, 100%, 98.08% and 95.33%. The F1 scores of each class were 100%, 99.04%, 95.33% and 96.00%.ConclusionsWe provided support for an AI method to classify ophthalmic emergency severity based on symptoms.

Download Full-text

SaltSeg: Automatic 3D salt segmentation using a deep convolutional neural network

Interpretation ◽

10.1190/int-2018-0235.1 ◽

2019 ◽

Vol 7 (3) ◽

pp. SE113-SE122 ◽

Cited By ~ 26

Author(s):

Yunzhi Shi ◽

Xinming Wu ◽

Sergey Fomel

Keyword(s):

Large Scale ◽

Model Building ◽

Ground Truth ◽

Velocity Model ◽

Training Data ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Seismic Image ◽

Data Generator

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.

Download Full-text

Predicting voluntary turnover through human resources database analysis

Management Research Review ◽

10.1108/mrr-04-2017-0098 ◽

2018 ◽

Vol 41 (1) ◽

pp. 96-112 ◽

Cited By ~ 3

Author(s):

Evy Rombaut ◽

Marie-Anne Guerry

Keyword(s):

Logistic Regression ◽

Decision Tree ◽

Human Resources ◽

Real Life ◽

Model Performance ◽

Voluntary Turnover ◽

Private Company ◽

Data Set ◽

Content Type ◽

Individual Level

Purpose This paper aims to question whether the available data in the human resources (HR) system could result in reliable turnover predictions without supplementary survey information. Design/methodology/approach A decision tree approach and a logistic regression model for analysing turnover were introduced. The methodology is illustrated on a real-life data set of a Belgian branch of a private company. The model performance is evaluated by the area under the ROC curve (AUC) measure. Findings It was concluded that data in the personnel system indeed lead to valuable predictions of turnover. Practical implications The presented approach brings determinants of voluntary turnover to the surface. The results yield useful information for HR departments. Where the logistic regression results in a turnover probability at the individual level, the decision tree makes it possible to ascertain employee groups that are at risk for turnover. With the data set-based approach, each company can, immediately, ascertain their own turnover risk. Originality/value The study of a data-driven approach for turnover investigation has not been done so far.

Download Full-text

Development of a Risk Score for Prediction of Overall Survival Following Umbilical Cord Blood Transplantation in Acute Leukemia Patients: A Study from the Acute Leukemia Working Party (WP) and Paediatric Disease WP of the European Society for Blood and Marrow Transplantation (EBMT), and Eurocord

Blood ◽

10.1182/blood.v128.22.1169.1169 ◽

2016 ◽

Vol 128 (22) ◽

pp. 1169-1169

Author(s):

Roni Shouval ◽

Annalisa Ruggeri ◽

Myriam Labopin ◽

Mohamad Mohty ◽

Guillermo Sanz ◽

...

Keyword(s):

Overall Survival ◽

Acute Leukemia ◽

Cord Blood ◽

Risk Score ◽

Research Funding ◽

Validation Dataset ◽

Validation Data ◽

Data Set ◽

Validation Set ◽

Score Table

Abstract Background: Prognostic scoring systems for allogeneic stem cell transplantation (HSCT) are of clinical value when determining a leukemic patient's suitability for this curative, but risky, procedure. Several such scores have been developed over the years for HSCT from sibling or unrelated donors, but no predictive score has been developed specifically for umbilical cord blood transplantation (UCBT). Although individual parameters have been identified to be associated with UCBT outcomes in acute leukemia (AL) patients, integrative tools for risk evaluation in this setting are lacking. We sought to develop a prediction model for overall survival (OS) (primary objective) and leukemia free survival (LFS) (secondary objective) at 2 years following UCBT in acute leukemia patients. Methods: A retrospective, international registry-based study, of 3140 acute leukemia patients who underwent UCBT from 2004 through 2014. Inclusion criteria were patients with AL receiving single or double cord blood units transplantation. Median follow up was 30 months. The dataset was geographically split into a derivation (65%) and validation set (35%). A Random Survival Forest was utilized to identify predictive factors. Top predictors were introduced into a Cox regression model, and a risk score was constructed according to each variable's hazard. Results: The median age at UCBT was 21.9 years. The 2 years OS rate was 47.7% (95% CI: 45.8-49.6). After identifying the top predictive variables, the UCBT risk score (Table 1) was constructed using 9 variables (disease status, diagnosis, cryopreserved cell dose, age, center experience, recipient cytomegalovirus sero-status, degree of HLA mismatch, previous autograft and anti thymocyte globulin administration). Over the derivation and validation datasets, a higher score was associated with decreasing probabilities for 2 years OS and LFS, ranging over the validation set from 0.72 (0.64-0.8, 95% CI) and 0.68 (0.6-0.76, 95% CI) to 0.13 (0.06-0.27, 95% CI) and 0.14 (0.07-0.28, 95% CI), respectively (Figure 1). An increasing score was also associated with increasing hazard of the predictive outcomes (Table 2). The score's discrimination (AUC) over the validation set for 2 years OS and LFS was 68.26 (64.25-72.27, 95% CI) and 66.95 (62.88-71.02, 95% CI), respectively. Calibration was excellent. Conclusion: We have developed the first integrative score for prediction of overall survival and leukemia free survival in acute leukemia patients undergoing a UCBT. The score is simple and stratifies patients into distinct risk groups. Table 1 The UCBT Risk Score Table 1. The UCBT Risk Score Table 2 Association between the UCBT risk score and 2 years OS and LFS over the validation dataset Table 2. Association between the UCBT risk score and 2 years OS and LFS over the validation dataset Figure 1 Overall survival stratified by the UCBT risk score over the validation data set Figure 1. Overall survival stratified by the UCBT risk score over the validation data set Disclosures Bader: Servier: Consultancy, Honoraria; Neovii Biotech: Research Funding; Riemser: Research Funding; Medac: Consultancy, Research Funding; Novartis: Consultancy, Honoraria.

Download Full-text

A review and framework for the evaluation of pixel-level uncertainty estimates in satellite aerosol remote sensing

Atmospheric Measurement Techniques ◽

10.5194/amt-13-373-2020 ◽

2020 ◽

Vol 13 (2) ◽

pp. 373-404 ◽

Cited By ~ 4

Author(s):

Andrew M. Sayer ◽

Yves Govaerts ◽

Pekka Kolmonen ◽

Antti Lipponen ◽

Marta Luffarelli ◽

...

Keyword(s):

Uncertainty Propagation ◽

Optimal Estimation ◽

Sensitivity Analyses ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Formal Error ◽

Uncertainty Estimates ◽

Diagnostic Approaches

Abstract. Recent years have seen the increasing inclusion of per-retrieval prognostic (predictive) uncertainty estimates within satellite aerosol optical depth (AOD) data sets, providing users with quantitative tools to assist in the optimal use of these data. Prognostic estimates contrast with diagnostic (i.e. relative to some external truth) ones, which are typically obtained using sensitivity and/or validation analyses. Up to now, however, the quality of these uncertainty estimates has not been routinely assessed. This study presents a review of existing prognostic and diagnostic approaches for quantifying uncertainty in satellite AOD retrievals, and it presents a general framework to evaluate them based on the expected statistical properties of ensembles of estimated uncertainties and actual retrieval errors. It is hoped that this framework will be adopted as a complement to existing AOD validation exercises; it is not restricted to AOD and can in principle be applied to other quantities for which a reference validation data set is available. This framework is then applied to assess the uncertainties provided by several satellite data sets (seven over land, five over water), which draw on methods from the empirical to sensitivity analyses to formal error propagation, at 12 Aerosol Robotic Network (AERONET) sites. The AERONET sites are divided into those for which it is expected that the techniques will perform well and those for which some complexity about the site may provide a more severe test. Overall, all techniques show some skill in that larger estimated uncertainties are generally associated with larger observed errors, although they are sometimes poorly calibrated (i.e. too small or too large in magnitude). No technique uniformly performs best. For powerful formal uncertainty propagation approaches such as optimal estimation, the results illustrate some of the difficulties in appropriate population of the covariance matrices required by the technique. When the data sets are confronted by a situation strongly counter to the retrieval forward model (e.g. potentially mixed land–water surfaces or aerosol optical properties outside the family of assumptions), some algorithms fail to provide a retrieval, while others do but with a quantitatively unreliable uncertainty estimate. The discussion suggests paths forward for the refinement of these techniques.

Download Full-text

DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning

Genes ◽

10.3390/genes10100778 ◽

2019 ◽

Vol 10 (10) ◽

pp. 778 ◽

Cited By ~ 6

Author(s):

Liu ◽

Pan ◽

Li ◽

Yang ◽

...

Keyword(s):

Dna Methylation ◽

Deep Learning ◽

Sensitivity And Specificity ◽

Test Data ◽

Data Sets ◽

Methylation Data ◽

Average Sensitivity ◽

Validation Data ◽

Data Set ◽

Cancer Types

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.

Download Full-text

Development and External Validation of Diagnostic Model for Coronary Microvascular Obstruction: Algorithm Development and Validation (Preprint)

10.2196/preprints.23052 ◽

2020 ◽

Author(s):

Yong Li ◽

Shuzheng Lyu

Keyword(s):

Logistic Regression ◽

Regression Analysis ◽

Logistic Regression Analysis ◽

Microvascular Obstruction ◽

External Validation ◽

Diagnostic Model ◽

Total Occlusion ◽

Validation Data ◽

Data Set ◽

Acute Stemi

BACKGROUND Prevention of coronary microvascular obstruction /no-reflow phenomenon(CMVO/NR) is a crucial step in improving prognosis of patients with acute ST segment elevation myocardial infarction (STEMI )during primary percutaneous coronary intervention (PPCI). OBJECTIVE The objective of our study was to develop and externally validate a diagnostic model of CMVO/NR in patients with acute STEMI underwent PPCI. METHODS Design: Multivariate logistic regression of a cohort of acute STEMI patients. Setting: Emergency department ward of a university hospital. Participants: Diagnostic model development: Totally 1232 acute STEMI patients who were consecutively treated with PPCI from November 2007 to December 2013. External validation: Totally 1301 acute STEMI patients who were treated with PPCI from January 2014 to June 2018. Outcomes: CMVO/NR during PPCI. We used logistic regression analysis to analyze the risk factors of CMVO/NR in the development data set. We developed a diagnostic model of CMVO/NR and constructed a nomogram.We assessed the predictive performance of the diagnostic model in the validation data sets by examining measures of discrimination, calibration, and decision curve analysis (DCA). RESULTS A total of 147 out of 1,232 participants (11.9%) presented CMVO/NR in the development dataset.The strongest predictors of CMVO/NR were age, periprocedural bradycardia, using thrombus aspiration devices during procedure and total occlusion of culprit vessel. Logistic regression analysis showed that the differences between two group with and without CMVO/NR in age( odds ratios (OR)1.031; 95% confidence interval(CI), 1.015 ~1.048 ; P <.001), periprocedural bradycardia (OR 2.151;95% CI,1.472~ 3.143 ; P <.001) , total occlusion of the culprit vessel (OR 1.842;95% CI, 1.095~ 3.1 ; P =.021) , and using thrombus aspirationdevices during procedure (OR 1.631; 95% CI, 1.029~ 2.584 ; P =.037).We developed a diagnostic model of CMVO/NR. The area under the receiver operating characteristic curve (AUC) was .6833±.023. We constructed a nomogram. CMVO/NR occurred in 120 out of 1,301 participants (9.2%) in the validation data set. The AUC was .6547±.025. Discrimination, calibration, and DCA were satisfactory. Date of approved by ethic committee:16 May 2019. Date of data collection start: 1 June 2019. Numbers recruited as of submission of the manuscript:2,533. CONCLUSIONS We developed and externally validated a diagnostic model of CMVO/NR during PPCI. CLINICALTRIAL We registered this study with WHO International Clinical Trials Registry Platform on 16 May 2019. Registration number: ChiCTR1900023213. http://www.chictr.org.cn/edit.aspx?pid=39057&htm=4.

Download Full-text