Predicting Undesired Treatment Outcome in Mental Healthcare: Machine Learning Study (Preprint)

2019 ◽  
Author(s):  
Kasper Van Mens ◽  
Joran Lokkerbol ◽  
Richard Janssen ◽  
Robert de Lange ◽  
Bea Tiemens

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.

Author(s):  
Harsha A K

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.


2021 ◽  
pp. 1-29
Author(s):  
Fikrewold H. Bitew ◽  
Corey S. Sparks ◽  
Samuel H. Nyarko

Abstract Objective: Child undernutrition is a global public health problem with serious implications. In this study, estimate predictive algorithms for the determinants of childhood stunting by using various machine learning (ML) algorithms. Design: This study draws on data from the Ethiopian Demographic and Health Survey of 2016. Five machine learning algorithms including eXtreme gradient boosting (xgbTree), k-nearest neighbors (K-NN), random forest (RF), neural network (NNet), and the generalized linear models (GLM) were considered to predict the socio-demographic risk factors for undernutrition in Ethiopia. Setting: Households in Ethiopia. Participants: A total of 9,471 children below five years of age. Results: The descriptive results show substantial regional variations in child stunting, wasting, and underweight in Ethiopia. Also, among the five ML algorithms, xgbTree algorithm shows a better prediction ability than the generalized linear mixed algorithm. The best predicting algorithm (xgbTree) shows diverse important predictors of undernutrition across the three outcomes which include time to water source, anemia history, child age greater than 30 months, small birth size, and maternal underweight, among others. Conclusions: The xgbTree algorithm was a reasonably superior ML algorithm for predicting childhood undernutrition in Ethiopia compared to other ML algorithms considered in this study. The findings support improvement in access to water supply, food security, and fertility regulation among others in the quest to considerably improve childhood nutrition in Ethiopia.


2020 ◽  
Vol 9 (9) ◽  
pp. 507
Author(s):  
Sanjiwana Arjasakusuma ◽  
Sandiaga Swahyu Kusuma ◽  
Stuart Phinn

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.


2018 ◽  
Vol 12 (2) ◽  
pp. 85-98 ◽  
Author(s):  
Barry E King ◽  
Jennifer L Rice ◽  
Julie Vaughan

Research predicting National Hockey League average attendance is presented. The seasons examined are the 2013 hockey season through the beginning of the 2017 hockey season. Multiple linear regression and three machine learning algorithms – random forest, M5 prime, and extreme gradient boosting – are employed to predict out-of-sample average home game attendance. Extreme gradient boosting generated the lowest out-of-sample root mean square error.  The team identifier (team name), the number of Twitter followers (a surrogate for team popularity), median ticket price, and arena capacity have appeared as the top four predictor variables. 


2021 ◽  
Vol 8 ◽  
Author(s):  
Jiang Zhu ◽  
Jinxin Zheng ◽  
Longfei Li ◽  
Rui Huang ◽  
Haoyu Ren ◽  
...  

Purpose: While there are no clear indications of whether central lymph node dissection is necessary in patients with T1-T2, non-invasive, clinically uninvolved central neck lymph nodes papillary thyroid carcinoma (PTC), this study seeks to develop and validate models for predicting the risk of central lymph node metastasis (CLNM) in these patients based on machine learning algorithms.Methods: This is a retrospective study comprising 1,271 patients with T1-T2 stage, non-invasive, and clinically node negative (cN0) PTC who underwent surgery at the Department of Endocrine and Breast Surgery of The First Affiliated Hospital of Chongqing Medical University from February 1, 2016, to December 31, 2018. We applied six machine learning (ML) algorithms, including Logistic Regression (LR), Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Decision Tree (DT), and Neural Network (NNET), coupled with preoperative clinical characteristics and intraoperative information to develop prediction models for CLNM. Among all the samples, 70% were randomly selected to train the models while the remaining 30% were used for validation. Indices like the area under the receiver operating characteristic (AUROC), sensitivity, specificity, and accuracy were calculated to test the models' performance.Results: The results showed that ~51.3% (652 out of 1,271) of the patients had pN1 disease. In multivariate logistic regression analyses, gender, tumor size and location, multifocality, age, and Delphian lymph node status were all independent predictors of CLNM. In predicting CLNM, six ML algorithms posted AUROC of 0.70–0.75, with the extreme gradient boosting (XGBoost) model standing out, registering 0.75. Thus, we employed the best-performing ML algorithm model and uploaded the results to a self-made online risk calculator to estimate an individual's probability of CLNM (https://jin63.shinyapps.io/ML_CLNM/).Conclusions: With the incorporation of preoperative and intraoperative risk factors, ML algorithms can achieve acceptable prediction of CLNM with Xgboost model performing the best. Our online risk calculator based on ML algorithm may help determine the optimal extent of initial surgical treatment for patients with T1-T2 stage, non-invasive, and clinically node negative PTC.


2021 ◽  
Author(s):  
Hossein Sahour ◽  
Vahid Gholami ◽  
Javad Torkman ◽  
Mehdi Vazifedan ◽  
Sirwe Saeedi

Abstract Monitoring temporal variation of streamflow is necessary for many water resources management plans, yet, such practices are constrained by the absence or paucity of data in many rivers around the world. Using a permanent river in the north of Iran as a test site, a machine learning framework was proposed to model the streamflow data in the three periods of growing seasons based on tree-rings and vessel features of the Zelkova carpinifolia species. First, full-disc samples were taken from 30 trees near the river, and the samples went through preprocessing, cross-dating, standardization, and time series analysis. Two machine learning algorithms, namely random forest (RF) and extreme gradient boosting (XGB), were used to model the relationships between dendrochronology variables (tree-rings and vessel features in the three periods of growing seasons) and the corresponding streamflow rates. The performance of each model was evaluated using statistical coefficients (coefficient of determination (R-squared), Nash-Sutcliffe efficiency (NSE), and root-mean-square error (NRMSE)). Findings demonstrate that consideration should be given to the XGB model in streamflow modeling given its apparent enhanced performance (R-squared: 0.87; NSE: 0.81; and NRMSE: 0.43) over the RF model (R-squared: 0.82; NSE: 0.71; and NRMSE: 0.52). Further, the results showed that the models perform better in modeling the normal and low flows compared to extremely high flows. Finally, the tested models were used to reconstruct the temporal streamflow during the past decades (1970–1981).


Author(s):  
R. Madhuri ◽  
S. Sistla ◽  
K. Srinivasa Raju

Abstract Assessing floods and their likely impact in climate change scenarios will enable the facilitation of sustainable management strategies. In this study, five machine learning (ML) algorithms, namely (i) Logistic Regression, (ii) Support Vector Machine, (iii) K-nearest neighbor, (iv) Adaptive Boosting (AdaBoost) and (v) Extreme Gradient Boosting (XGBoost), were tested for Greater Hyderabad Municipal Corporation (GHMC), India, to evaluate their clustering abilities to classify locations (flooded or non-flooded) for climate change scenarios. A geo-spatial database, with eight flood influencing factors, namely, rainfall, elevation, slope, distance from nearest stream, evapotranspiration, land surface temperature, normalised difference vegetation index and curve number, was developed for 2000, 2006 and 2016. XGBoost performed the best, with the highest mean area under curve score of 0.83. Hence, XGBoost was adopted to simulate the future flood locations corresponding to probable highest rainfall events under four Representative Concentration Pathways (RCPs), namely, 2.6, 4.5, 6.0 and 8.5 along with other flood influencing factors for 2040, 2056, 2050 and 2064, respectively. The resulting ranges of flood risk probabilities are predicted as 39–77%, 16–39%, 42–63% and 39–77% for the respective years.


2019 ◽  
Author(s):  
Allan C. Just ◽  
Yang Liu ◽  
Meytar Sorek-Hamer ◽  
Johnathan Rush ◽  
Michael Dorman ◽  
...  

Abstract. The atmospheric products of the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm include column water vapor (CWV) at 1 km resolution, derived from daily overpasses of NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Aqua and Terra satellites. We have recently shown that machine learning using extreme gradient boosting (XGBoost) can improve the estimation of MAIAC aerosol optical depth (AOD). Although MAIAC CWV is generally well validated (Pearson’s R > 0.97 versus CWV from AERONET sun photometers), it has not yet been assessed whether machine-learning approaches can further improve CWV. Using a novel spatiotemporal cross-validation approach to avoid overfitting, our XGBoost model with nine features derived from land use terms, date, and ancillary variables from the MAIAC retrieval, quantifies and can correct a substantial portion of measurement error relative to collocated measures at AERONET sites (26.9 % and 16.5 % decrease in Root Mean Square Error (RMSE) for Terra and Aqua datasets, respectively) in the Northeastern USA, 2000–2015. We use machine-learning interpretation tools to illustrate complex patterns of measurement error and describe a positive bias in MAIAC Terra CWV worsening in recent summertime conditions. We validate our predictive model on MAIAC CWV estimates at independent stations from the SuomiNet GPS network where our corrections decrease the RMSE by 19.7 % and 9.5 % for Terra and Aqua MAIAC CWV. Empirically correcting for measurement error with machine-learning algorithms is a post-processing opportunity to improve satellite-derived CWV data for Earth science and remote sensing applications.


2021 ◽  
Vol 12 (2) ◽  
pp. 28-55
Author(s):  
Fabiano Rodrigues ◽  
Francisco Aparecido Rodrigues ◽  
Thelma Valéria Rocha Rodrigues

Este estudo analisa resultados obtidos com modelos de machine learning para predição do sucesso de startups. Como proxy de sucesso considera-se a perspectiva do investidor, na qual a aquisição da startup ou realização de IPO (Initial Public Offering) são formas de recuperação do investimento. A revisão da literatura aborda startups e veículos de financiamento, estudos anteriores sobre predição do sucesso de startups via modelos de machine learning, e trade-offs entre técnicas de machine learning. Na parte empírica, foi realizada uma pesquisa quantitativa baseada em dados secundários oriundos da plataforma americana Crunchbase, com startups de 171 países. O design de pesquisa estabeleceu como filtro startups fundadas entre junho/2010 e junho/2015, e uma janela de predição entre junho/2015 e junho/2020 para prever o sucesso das startups. A amostra utilizada, após etapa de pré-processamento dos dados, foi de 18.571 startups. Foram utilizados seis modelos de classificação binária para a predição: Regressão Logística, Decision Tree, Random Forest, Extreme Gradiente Boosting, Support Vector Machine e Rede Neural. Ao final, os modelos Random Forest e Extreme Gradient Boosting apresentaram os melhores desempenhos na tarefa de classificação. Este artigo, envolvendo machine learning e startups, contribui para áreas de pesquisa híbridas ao mesclar os campos da Administração e Ciência de Dados. Além disso, contribui para investidores com uma ferramenta de mapeamento inicial de startups na busca de targets com maior probabilidade de sucesso.   


2021 ◽  
Author(s):  
Mandana Modabbernia ◽  
Heather C Whalley ◽  
David Glahn ◽  
Paul M. Thompson ◽  
Rene S. Kahn ◽  
...  

Application of machine learning algorithms to structural magnetic resonance imaging (sMRI) data has yielded behaviorally meaningful estimates of the biological age of the brain (brain-age). The choice of the machine learning approach in estimating brain-age in children and adolescents is important because age-related brain changes in these age-groups are dynamic. However, the comparative performance of the multiple machine learning algorithms available has not been systematically appraised. To address this gap, the present study evaluated the accuracy (Mean Absolute Error; MAE) and computational efficiency of 21 machine learning algorithms using sMRI data from 2,105 typically developing individuals aged 5 to 22 years from five cohorts. The trained models were then tested in an independent holdout datasets, comprising 4,078 pre-adolescents (aged 9-10 years). The algorithms encompassed parametric and nonparametric, Bayesian, linear and nonlinear, tree-based, and kernel-based models. Sensitivity analyses were performed for parcellation scheme, number of neuroimaging input features, number of cross-validation folds, and sample size. The best performing algorithms were Extreme Gradient Boosting (MAE of 1.25 years for females and 1.57 years for males), Random Forest Regression (MAE of 1.23 years for females and 1.65 years for males) and Support Vector Regression with Radial Basis Function Kernel (MAE of 1.47 years for females and 1.72 years for males) which had acceptable and comparable computational efficiency. Findings of the present study could be used as a guide for optimizing methodology when quantifying age-related changes during development.


Sign in / Sign up

Export Citation Format

Share Document