scholarly journals Combination of ADASYN-N and Random Forest in Predicting of Obesity Status in Indonesia: A Case Study of Indonesian Basic Health Research 2013

2021 ◽  
Vol 2123 (1) ◽  
pp. 012039
Author(s):  
M Aqsha ◽  
SA Thamrin ◽  
Armin Lawi

Abstract Obesity is a pathological condition due to the accumulation of excessive fat needed for body functions. The risk factors for obesity are related to their obesity status. Various machine learning approaches are an alternative in predicting obesity status. However, in most cases, the available datasets are not sufficiently balanced in their data classes. The existence of data imbalances can cause the prediction results to be inaccurate. The purpose of this paper is to overcome the problem of data class imbalance and predict obesity status using the 2013 Indonesian Basic Health Research (RISKESDAS) data. Adaptive Synthetic Nominal (ADASYN-N) can be used to balance obesity status data. The balanced obesity status data is then predicted using one of the machine learning approaches, namely Random Forest. The results obtained show that through ADASYN-N with a balance level parameter of 1 (β = 100%) after synthetic data generation and Random Forest with a tree number of 200 and involving 7 variables as risk factors, giving the results of the classification of obesity status which is good. This can be seen from the AUC value of 84.41%.

2021 ◽  
Vol 8 ◽  
Author(s):  
Sri Astuti Thamrin ◽  
Dian Sidik Arsyad ◽  
Hedi Kuswanto ◽  
Armin Lawi ◽  
Sudirman Nasir

Obesity is strongly associated with multiple risk factors. It is significantly contributing to an increased risk of chronic disease morbidity and mortality worldwide. There are various challenges to better understand the association between risk factors and the occurrence of obesity. The traditional regression approach limits analysis to a small number of predictors and imposes assumptions of independence and linearity. Machine Learning (ML) methods are an alternative that provide information with a unique approach to the application stage of data analysis on obesity. This study aims to assess the ability of ML methods, namely Logistic Regression, Classification and Regression Trees (CART), and Naïve Bayes to identify the presence of obesity using publicly available health data, using a novel approach with sophisticated ML methods to predict obesity as an attempt to go beyond traditional prediction models, and to compare the performance of three different methods. Meanwhile, the main objective of this study is to establish a set of risk factors for obesity in adults among the available study variables. Furthermore, we address data imbalance using Synthetic Minority Oversampling Technique (SMOTE) to predict obesity status based on risk factors available in the dataset. This study indicates that the Logistic Regression method shows the highest performance. Nevertheless, kappa coefficients show only moderate concordance between predicted and measured obesity. Location, marital status, age groups, education, sweet drinks, fatty/oily foods, grilled foods, preserved foods, seasoning powders, soft/carbonated drinks, alcoholic drinks, mental emotional disorders, diagnosed hypertension, physical activity, smoking, and fruit and vegetables consumptions are significant in predicting obesity status in adults. Identifying these risk factors could inform health authorities in designing or modifying existing policies for better controlling chronic diseases especially in relation to risk factors associated with obesity. Moreover, applying ML methods on publicly available health data, such as Indonesian Basic Health Research (RISKESDAS) is a promising strategy to fill the gap for a more robust understanding of the associations of multiple risk factors in predicting health outcomes.


2021 ◽  
Vol 5 (1) ◽  
pp. 75-91
Author(s):  
Sri Astuti Thamrin ◽  
Dian Sidik ◽  
Hedi Kuswanto ◽  
Armin Lawi ◽  
Ansariadi Ansariadi

The accuracy of the data class is very important in classification with a machine learning approach. The more accurate the existing data sets and classes, the better the output generated by machine learning. In fact, classification can experience imbalance class data in which each class does not have the same portion of the data set it has. The existence of data imbalance will affect the classification accuracy. One of the easiest ways to correct imbalanced data classes is to balance it. This study aims to explore the problem of data class imbalance in the medium case dataset and to address the imbalance of data classes as well. The Synthetic Minority Over-Sampling Technique (SMOTE) method is used to overcome the problem of class imbalance in obesity status in Indonesia 2013 Basic Health Research (RISKESDAS). The results show that the number of obese class (13.9%) and non-obese class (84.6%). This means that there is an imbalance in the data class with moderate criteria. Moreover, SMOTE with over-sampling 600% can improve the level of minor classes (obesity). As consequence, the classes of obesity status balanced. Therefore, SMOTE technique was better compared to without SMOTE in exploring the obesity status of Indonesia RISKESDAS 2013.


Author(s):  
Sunhae Kim ◽  
Hye-Kyung Lee ◽  
Kounseok Lee

(1) Background: The Patient Health Questionnaire-9 (PHQ-9) is a tool that screens patients for depression in primary care settings. In this study, we evaluated the efficacy of PHQ-9 in evaluating suicidal ideation (2) Methods: A total of 8760 completed questionnaires collected from college students were analyzed. The PHQ-9 was scored in combination with and evaluated against four categories (PHQ-2, PHQ-8, PHQ-9, and PHQ-10). Suicidal ideations were evaluated using the Mini-International Neuropsychiatric Interview suicidality module. Analyses used suicide ideation as the dependent variable, and machine learning (ML) algorithms, k-nearest neighbors, linear discriminant analysis (LDA), and random forest. (3) Results: Random forest application using the nine items of the PHQ-9 revealed an excellent area under the curve with a value of 0.841, with 94.3% accuracy. The positive and negative predictive values were 84.95% (95% CI = 76.03–91.52) and 95.54% (95% CI = 94.42–96.48), respectively. (4) Conclusion: This study confirmed that ML algorithms using PHQ-9 in the primary care field are reliably accurate in screening individuals with suicidal ideation.


2020 ◽  
Author(s):  
Aaron Cardenas-Martinez ◽  
Victor Rodriguez-Galiano ◽  
Juan Antonio Luque-Espinar ◽  
Maria Paula Mendes

<p>The establishment of the sources and driven-forces of groundwater nitrate pollution is of paramount importance, contributing to agro-environmental measures implementation and evaluation. High concentrations of nitrates in groundwater occur all around the world, in rich and less developed countries.</p><p>In the case of Spain, 21.5% of the wells of the groundwater quality monitoring network showed mean concentrations above the quality standard (QS) of 50 mg/l. The objectives of this work were: i) to predict the current probability of having nitrate concentrations above the QS in Andalusian groundwater bodies (Spain) using past time features, being some of them obtained from satellite observations; ii) to assess the importance of features in the prediction; iii) to evaluate different machine learning approaches (ML) and feature selection techniques (FS).</p><p>Several predictive models based on an ML algorithm, the Random Forest, were used, as well as, FS techniques. 321 nitrate samples and respective predictive features were obtained from different groundwater bodies. These predictive features were divided into three groups, regarding their focus: agricultural production (phenology); livestock pressure (excretion rates); and environmental settings (soil characteristics and texture, geomorphology, and local climate conditions). Models were trained with the features of a year [YEAR (t<sub>0</sub>)], and then applied to new features obtained for the next year – [YEAR(t<sub>0+1</sub>)], performing k-fold cross-validation. Additionally, a further prediction was carried out for a present time – [YEAR(t<sub>0+n</sub>)], validating with an independent test. This methodology examined the use of a model, trained with previous nitrates concentrations and predictive features, for the prediction of current nitrates concentrations based on present features. Our findings showed an improvement in the predictive performance when using a wrapper with sequential search for FS when compared to the use alone of the Random Forest algorithm. Phenology features, derived from remotely sensed variables, were the most explanative features, performing better than the use of static land-use maps or vegetation index images (e.g., NDVI). They also provided much more comprehensive information, and more importantly, employing only extrinsic features of groundwater bodies.</p>


2019 ◽  
Author(s):  
Liam Brierley ◽  
Amy B. Pedersen ◽  
Mark E. J. Woolhouse

AbstractNovel infectious diseases continue to emerge within human populations. Predictive studies have begun to identify pathogen traits associated with emergence. However, emerging pathogens vary widely in virulence, a key determinant of their ultimate risk to public health. Here, we use structured literature searches to review the virulence of each of the 214 known human-infective RNA virus species. We then use a machine learning framework to determine whether viral virulence can be predicted by ecological traits including human-to-human transmissibility, transmission routes, tissue tropisms and host range. Using severity of clinical disease as a measurement of virulence, we identified potential risk factors using predictive classification tree and random forest ensemble models. The random forest model predicted literature-assigned disease severity of test data with 90.3% accuracy, compared to a null accuracy of 74.2%. In addition to viral taxonomy, the ability to cause systemic infection, having renal and/or neural tropism, direct contact or respiratory transmission, and limited (0 < R0 ≤ 1) human-to-human transmissibility were the strongest predictors of severe disease. We present a novel, comparative perspective on the virulence of all currently known human RNA virus species. The risk factors identified may provide novel perspectives in understanding the evolution of virulence and elucidating molecular virulence mechanisms. These risk factors could also improve planning and preparedness in public health strategies as part of a predictive framework for novel human infections.Author SummaryNewly emerging infectious diseases present potentially serious threats to global health. Although studies have begun to identify pathogen traits associated with the emergence of new human diseases, these do not address why emerging infections vary in the severity of disease they cause, often termed ‘virulence’. We test whether ecological traits of human viruses can act as predictors of virulence, as suggested by theoretical studies. We conduct the first systematic review of virulence across all currently known human RNA virus species. We adopt a machine learning approach by constructing a random forest, a model that aims to optimally predict an outcome using a specific structure of predictors. Predictions matched literature-assigned ratings for 28 of 31 test set viruses. Our predictive model suggests that higher virulence is associated with infection of multiple organ systems, nervous systems or the renal systems. Higher virulence was also associated with contact-based or airborne transmission, and limited capability to transmit between humans. These risk factors may provide novel starting points for questioning why virulence should evolve and identifying causative mechanisms of virulence. In addition, our work could suggest priority targets for infectious disease surveillance and future public health risk strategies.BlurbComparative analysis using machine learning shows specificity of tissue tropism and transmission biology can act as predictive risk factors for virulence of human RNA viruses.


2020 ◽  
pp. 1-24
Author(s):  
TINGBIN BIAN ◽  
JIN CHEN ◽  
QU FENG ◽  
JINGYI LI

We aim to compare econometric analyses with machine learning approaches in the context of Singapore private property market using transaction data covering the period of 1995–2018. A hedonic model is employed to quantify the premiums of important attributes and amenities, with a focus on the premium of distance to nearest Mass Rapid Transit (MRT) stations. In the meantime, an investigation using machine learning algorithms under three categories — LASSO, random forest and artificial neural networks is conducted in the same context with deeper insights on importance of determinants of property prices. The results suggest that the MRT distance premium is significant and moving 100[Formula: see text]m closer from the mean distance point to the nearest MRT station would increase the overall transacted price by about 15,000 Singapore dollars (SGD). Machine learning approaches generally achieve higher prediction accuracy and heterogeneous property age premium is suggested by LASSO. Using random forest algorithm, we find that property prices are mostly affected by key macroeconomic factors, such as the time of sale, as well as the size and floor level of property. Finally, an appraisal on different approaches is provided for researchers to utilize additional data sources and data-driven approaches to exploit potential causal effects in economic studies.


Author(s):  
Krishna Kumar Mohbey

In any industry, attrition is a big problem, whether it is about employee attrition of an organization or customer attrition of an e-commerce site. If we can accurately predict which customer or employee will leave their current company or organization, then it will save much time, effort, and cost of the employer and help them to hire or acquire substitutes in advance, and it would not create a problem in the ongoing progress of an organization. In this chapter, a comparative analysis between various machine learning approaches such as Naïve Bayes, SVM, decision tree, random forest, and logistic regression is presented. The presented result will help us in identifying the behavior of employees who can be attired over the next time. Experimental results reveal that the logistic regression approach can reach up to 86% accuracy over other machine learning approaches.


2012 ◽  
Vol 51 (01) ◽  
pp. 74-81 ◽  
Author(s):  
J. D. Malley ◽  
J. Kruppa ◽  
A. Dasgupta ◽  
K. G. Malley ◽  
A. Ziegler

SummaryBackground: Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem.Objectives: The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities.Methods: Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians.Results: Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software.Conclusions: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Eun Kyung Park ◽  
Kwang-sig Lee ◽  
Bo Kyoung Seo ◽  
Kyu Ran Cho ◽  
Ok Hee Woo ◽  
...  

AbstractRadiogenomics investigates the relationship between imaging phenotypes and genetic expression. Breast cancer is a heterogeneous disease that manifests complex genetic changes and various prognosis and treatment response. We investigate the value of machine learning approaches to radiogenomics using low-dose perfusion computed tomography (CT) to predict prognostic biomarkers and molecular subtypes of invasive breast cancer. This prospective study enrolled a total of 723 cases involving 241 patients with invasive breast cancer. The 18 CT parameters of cancers were analyzed using 5 machine learning models to predict lymph node status, tumor grade, tumor size, hormone receptors, HER2, Ki67, and the molecular subtypes. The random forest model was the best model in terms of accuracy and the area under the receiver-operating characteristic curve (AUC). On average, the random forest model had 13% higher accuracy and 0.17 higher AUC than the logistic regression. The most important CT parameters in the random forest model for prediction were peak enhancement intensity (Hounsfield units), time to peak (seconds), blood volume permeability (mL/100 g), and perfusion of tumor (mL/min per 100 mL). Machine learning approaches to radiogenomics using low-dose perfusion breast CT is a useful noninvasive tool for predicting prognostic biomarkers and molecular subtypes of invasive breast cancer.


Sign in / Sign up

Export Citation Format

Share Document