Bayesian Analysis for Dynamic Generalized Linear Latent Model with Application to Tree Survival Rate

Logistic regression model is the most popular regression technique, available for modeling categorical data especially for dichotomous variables. Classic logistic regression model is typically used to interpret relationship between response variables and explanatory variables. However, in real applications, most data sets are collected in follow-up, which leads to the temporal correlation among the data. In order to characterize the different variables correlations, a new method about the latent variables is introduced in this study. At the same time, the latent variables about AR (1) model are used to depict time dependence. In the framework of Bayesian analysis, parameters estimates and statistical inferences are carried out via Gibbs sampler with Metropolis-Hastings (MH) algorithm. Model comparison, based on the Bayes factor, and forecasting/smoothing of the survival rate of the tree are established. A simulation study is conducted to assess the performance of the proposed method and a pika data set is analyzed to illustrate the real application. Since Bayes factor approaches vary significantly, efficiency tests have been performed in order to decide which solution provides a better tool for the analysis of real relational data sets.

Download Full-text

GENE SELECTION USING LOGISTIC REGRESSIONS BASED ON AIC, BIC AND MDL CRITERIA

New Mathematics and Natural Computation ◽

10.1142/s179300570500007x ◽

2005 ◽

Vol 01 (01) ◽

pp. 129-145 ◽

Cited By ~ 15

Author(s):

XIAOBO ZHOU ◽

XIAODONG WANG ◽

EDWARD R. DOUGHERTY

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Gene Selection ◽

Information Criterion ◽

Cancer Classification ◽

Data Sets ◽

Classification Methods ◽

Gene Expressions ◽

Experimental Conditions

In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables (gene expressions) and the small number of experimental conditions. Many gene-selection and classification methods have been proposed; however most of these treat gene selection and classification separately, and not under the same model. We propose a Bayesian approach to gene selection using the logistic regression model. The Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the minimum description length (MDL) principle are used in constructing the posterior distribution of the chosen genes. The same logistic regression model is then used for cancer classification. Fast implementation issues for these methods are discussed. The proposed methods are tested on several data sets including those arising from hereditary breast cancer, small round blue-cell tumors, lymphoma, and acute leukemia. The experimental results indicate that the proposed methods show high classification accuracies on these data sets. Some robustness and sensitivity properties of the proposed methods are also discussed. Finally, mixing logistic-regression based gene selection with other classification methods and mixing logistic-regression-based classification with other gene-selection methods are considered.

Download Full-text

Cancer classification and biomarker selection via a penalized logsum network-based logistic regression model

Technology and Health Care ◽

10.3233/thc-218026 ◽

2021 ◽

Vol 29 ◽

pp. 287-295

Author(s):

Zhiming Zhou ◽

Haihui Huang ◽

Yong Liang

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Gene Selection ◽

Simulated Data ◽

Biological Data ◽

Cancer Classification ◽

High Dimensional ◽

Data Set ◽

Biomarker Selection

BACKGROUND: In genome research, it is particularly important to identify molecular biomarkers or signaling pathways related to phenotypes. Logistic regression model is a powerful discrimination method that can offer a clear statistical explanation and obtain the classification probability of classification label information. However, it is unable to fulfill biomarker selection. OBJECTIVE: The aim of this paper is to give the model efficient gene selection capability. METHODS: In this paper, we propose a new penalized logsum network-based regularization logistic regression model for gene selection and cancer classification. RESULTS: Experimental results on simulated data sets show that our method is effective in the analysis of high-dimensional data. For a large data set, the proposed method has achieved 89.66% (training) and 90.02% (testing) AUC performances, which are, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods. CONCLUSIONS: The proposed method can be considered a promising tool for gene selection and cancer classification of high-dimensional biological data.

Download Full-text

Meteorological and hydrological conditions triggering rockfall events in Germany

10.5194/egusphere-egu21-5367 ◽

2021 ◽

Author(s):

Katrin Nissen ◽

Stefan Rupp ◽

Björn Guse ◽

Uwe Ulbrich ◽

Sergiy Vorogushyn ◽

...

Keyword(s):

Logistic Regression ◽

Soil Moisture ◽

Statistical Model ◽

Regression Model ◽

Logistic Regression Model ◽

Daily Precipitation ◽

Hydrologic Model ◽

Data Set ◽

Hydrological Conditions ◽

Landslide Database

In this study we present the results of a logistic regression model aimed at describing changes in probabilities for rockfall events in Germany in response to changes in meteorological and hydrological conditions.The rockfall events for this study are taken from the landslide database for Germany (Damm and Klose, 2015). The meteorological variables we tested as predictors for the logistic regression model are daily precipitation from the REGNIE data set (Rauthe et al. 2013), hourly precipitation from the RADKLIM radar climatology (Winterrath et al., 2018) and temperature from the E-OBS data set (Cornes et al., 2018). As there is no observational soil moisture data set covering the entire country, we used soil moisture modelled with the state-of-the-art hydrological model mHM (Samaniego et al. 2010), which was calibrated using gauge measurements.In order to select the best statistical model we tested a large number of physically plausible combinations of meteorological and hydrological predictors. Each model was checked using cross-validation. The decision on the final model was based on the value of the logarithmic skill score and on expert judgement.The final statistical model includes the local percentile of daily precipitation, total relative soil moisture and freeze-thawing cycles in the previous weeks as predictors. It was found that daily precipitation is the most important parameter in the model. An increase of daily precipitation from its median to its 80th percentile approximately doubles the probability for a rockfall event. Higher soil moisture and the occurrence of freeze-thaw cycles also increase the probability for rockfall events.&#160; Cornes, R. C. et al., 2018: An ensemble version of the E&#8208;OBS temperature and precipitation data sets. Journal of Geophysical Research: Atmospheres, 123, 9391&#8211; 9409.Damm, B., Klose, M., 2015. The landslide database for Germany: Closing the gap at national level. Geomorphology 249, 82&#8211;93Rauthe, M. et al., 2013: A Central European precipitation climatology &#8211; Part I: Generation and validation of a high-reso-lution gridded daily data set (HYRAS), Vol. 22(3), p 235&#8211;256.Samaniego, L. et al., 2010: Multiscale parameter regionalization of a grid-based hydrologic model at the mesoscale. Water Resour. Res., 46,W05523Winterrath, T. et al., 2018: RADKLIM Version 2017.002: Reprocessed gauge-adjusted radar data, one-hour precipitation sums (RW), DOI: 10.5676/DWD/RADKLIM_RW_V2017.002.

Download Full-text

Institutions, entrepreneurial adaptation, and the legal form of the organization

Journal of Entrepreneurship and Public Policy ◽

10.1108/jepp-10-2019-0087 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Indu Khurana ◽

Dmitriy Krichevskiy ◽

Gregory Dempster ◽

Sean Stimpson

Keyword(s):

Logistic Regression ◽

Survival Rate ◽

Regression Model ◽

Economic Freedom ◽

Logistic Regression Model ◽

Legal Form ◽

Content Type ◽

Legal Structure ◽

The Usa ◽

Over Time

PurposeThis paper aims to examine how economic freedom impacts the initial choice of legal structure for startup firms. The authors do this by first exploring whether economic freedom is an essential determinant of the initial legal form of organization (LFO). The authors then explore the impact of economic freedom on firms' choice of changing their initial legal structure over time and how this change impacts their survival rate.Design/methodology/approachThe authors employ a multinomial logistic regression model to measure the initial determinants of LFO by utilizing an eight-year panel data set of 4,928 startups in the USA through the Kauffman firm survey and merge it with the Economic Freedom in North American index from the Fraser Institute. The authors then employ a logistic regression model to examine the determinants facilitating a change in legal structure over time.FindingsThe results show that economic freedom is a significant determinant in the choice of legal structure. The findings also report that the majority of startups do not change their legal form, but of those that do change the legal structure show a higher survival rate.Research limitations/implicationsMajor limitations are the size of the data and the nature of somewhat limited economic freedom differences with the USA. More nuanced measures of economic freedom would be highly desirable.Practical implicationsPolicymakers should take note that limited red tape, smoothly working labor markets and straightforward processes for changes of legal structures of organizations would improve survival and growth odds for entrepreneurs.Originality/valueDrawing on the theory of institutions, the authors attempt to bridge a gap in the literature by explicitly analyzing the determinants of the legal structure in startups in light of economic freedom. Institutional factors do not work in isolation; therefore, the authors also employ traditional entrepreneur-specific variables that affect the choice of legal structure in addition to the institutional framework.

Download Full-text

An Artificial Neural Network–Based Pediatric Mortality Risk Score: Development and Performance Evaluation Using Data From a Large North American Registry (Preprint)

10.2196/preprints.24079 ◽

2020 ◽

Author(s):

Niema Ghanad Poor ◽

Nicholas C West ◽

Rama Syamala Sreepada ◽

Srinivas Murthy ◽

Matthias Görges

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Mortality Risk ◽

Regression Models ◽

Logistic Regression Model ◽

Single Layer ◽

Ann Model ◽

Data Set ◽

Using Data ◽

Better Than

BACKGROUND In the pediatric intensive care unit (PICU), quantifying illness severity can be guided by risk models to enable timely identification and appropriate intervention. Logistic regression models, including the pediatric index of mortality 2 (PIM-2) and pediatric risk of mortality III (PRISM-III), produce a mortality risk score using data that are routinely available at PICU admission. Artificial neural networks (ANNs) outperform regression models in some medical fields. OBJECTIVE In light of this potential, we aim to examine ANN performance, compared to that of logistic regression, for mortality risk estimation in the PICU. METHODS The analyzed data set included patients from North American PICUs whose discharge diagnostic codes indicated evidence of infection and included the data used for the PIM-2 and PRISM-III calculations and their corresponding scores. We stratified the data set into training and test sets, with approximately equal mortality rates, in an effort to replicate real-world data. Data preprocessing included imputing missing data through simple substitution and normalizing data into binary variables using PRISM-III thresholds. A 2-layer ANN model was built to predict pediatric mortality, along with a simple logistic regression model for comparison. Both models used the same features required by PIM-2 and PRISM-III. Alternative ANN models using single-layer or unnormalized data were also evaluated. Model performance was compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC) and their empirical 95% CIs. RESULTS Data from 102,945 patients (including 4068 deaths) were included in the analysis. The highest performing ANN (AUROC 0.871, 95% CI 0.862-0.880; AUPRC 0.372, 95% CI 0.345-0.396) that used normalized data performed better than PIM-2 (AUROC 0.805, 95% CI 0.801-0.816; AUPRC 0.234, 95% CI 0.213-0.255) and PRISM-III (AUROC 0.844, 95% CI 0.841-0.855; AUPRC 0.348, 95% CI 0.322-0.367). The performance of this ANN was also significantly better than that of the logistic regression model (AUROC 0.862, 95% CI 0.852-0.872; AUPRC 0.329, 95% CI 0.304-0.351). The performance of the ANN that used unnormalized data (AUROC 0.865, 95% CI 0.856-0.874) was slightly inferior to our highest performing ANN; the single-layer ANN architecture performed poorly and was not investigated further. CONCLUSIONS A simple ANN model performed slightly better than the benchmark PIM-2 and PRISM-III scores and a traditional logistic regression model trained on the same data set. The small performance gains achieved by this two-layer ANN model may not offer clinically significant improvement; however, further research with other or more sophisticated model designs and better imputation of missing data may be warranted. CLINICALTRIAL

Download Full-text

A Probability of Growth Model for Escherichia coli O157:H7 as a Function of Temperature, pH, Acetic Acid, and Salt†

Journal of Food Protection ◽

10.4315/0362-028x-64.12.1922 ◽

2001 ◽

Vol 64 (12) ◽

pp. 1922-1928 ◽

Cited By ~ 40

Author(s):

ROBIN C. McKELLAR ◽

XEUWEN LU

Keyword(s):

Escherichia Coli ◽

Logistic Regression ◽

Acetic Acid ◽

Regression Model ◽

Logistic Regression Model ◽

Fractional Factorial ◽

Data Sets ◽

Escherichia Coli O157 ◽

Growth Interface ◽

Five Factors

Data accumulated on the growth of Escherichia coli O157:H7 in tryptic soy broth (TSB) were used to develop a logistic regression model describing the growth-no growth interface as a function of temperature, pH, salt, sucrose, and acetic acid. A fractional factorial design with five factors was used at the following levels: temperature (10 to 30°C), acetic acid (0 to 4%), salt (0.5 to 16.5%), sucrose (0 to 8%), and pH (3.5 to 6.0). A total of 1,820 treatment combinations were used to create the model, which correctly predicted 1,802 (99%) of the points, with 10 false positives and 8 false negatives. Concordance was 99.9%, discordance was 0.1%, and the maximum rescaled R2 value was 0.927. Acetic acid was the factor having the most influence on the growth-no growth interface; addition of as little as 0.5% resulted in an increase in the observed minimum pH for growth from 4.0 to 5.5. Increasing the salt concentration also had a significant effect on the interface; at all acetic acid concentrations, increasing salt increased the minimum temperature at which growth was observed. Using two literature data sets (26 conditions), the logistic model failed to predict growth in only one case. The results of this study suggest that the logistic regression model can be used to make conservative predictions of the growth-no growth interface of E. coli O157:H7.

Download Full-text

Early Detection of Severe Functional Impairment Among Adolescents With Major Depression Using Logistic Classifier

Frontiers in Public Health ◽

10.3389/fpubh.2020.622007 ◽

2021 ◽

Vol 8 ◽

Author(s):

I.-Ming Chiu ◽

Wenhua Lu ◽

Fangming Tian ◽

Daniel Hart

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Model ◽

Logistic Regression Model ◽

Age Groups ◽

Recall Rate ◽

Training Data ◽

Statistical Tool ◽

Data Set ◽

Severe Impairment

Machine learning is about finding patterns and making predictions from raw data. In this study, we aimed to achieve two goals by utilizing the modern logistic regression model as a statistical tool and classifier. First, we analyzed the associations between Major Depressive Episode with Severe Impairment (MDESI) in adolescents with a list of broadly defined sociodemographic characteristics. Using findings from the logistic model, the second and ultimate goal was to identify the potential MDESI cases using a logistic model as a classifier (i.e., a predictive mechanism). Data on adolescents aged 12–17 years who participated in the National Survey on Drug Use and Health (NSDUH), 2011–2017, were pooled and analyzed. The logistic regression model revealed that compared with males and adolescents aged 12-13, females and those in the age groups of 14-15 and 16-17 had higher risk of MDESI. Blacks and Asians had lower risk of MDESI than Whites. Living in single-parent household, having less authoritative parents, having negative school experiences further increased adolescents' risk of having MDESI. The predictive model successfully identified 66% of the MDESI cases (recall rate) and accurately identified 72% of the MDESI and MDESI-free cases (accuracy rate) in the training data set. The rates of both recall and accuracy remained about the same (66 and 72%) using the test data. Results from this study confirmed that the logistic model, when used as a classifier, can identify potential cases of MDESI in adolescents with acceptable recall and reasonable accuracy rates. The algorithmic identification of adolescents at risk for depression may improve prevention and intervention.

Download Full-text

Analysis of Individual Loan Defaults Using Logit under Supervised Machine Learning Approach

Asian Journal of Probability and Statistics ◽

10.9734/ajpas/2019/v3i430100 ◽

2019 ◽

pp. 1-12

Author(s):

Dominic M. Obare ◽

Gladys G. Njoroge ◽

Moses M. Muraya

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Regression Model ◽

Test Data ◽

Logistic Regression Model ◽

Functional Form ◽

Supervised Machine Learning ◽

Data Set ◽

Machine Learning Approach ◽

Loan Defaults

Financial institutions have a large amount of data on their borrowers, which can be used to predict the probability of borrowers defaulting their loan or not. Some of the models that have been used to predict individual loan defaults include linear discriminant analysis models and extreme value theory models. These models are parametric in nature since they assume that the response being investigated takes a particular functional form. However, there is a possibility that the functional form used to estimate the response is very different from the actual functional form of the response. The purpose of this research was to analyze individual loan defaults in Kenya using the logistic regression model. The data used in this study was obtained from equity bank of Kenya for the period between 2006 to 2016. A random sample of 1000 loan applicants whose loans had been approved by equity bank of Kenya during this period was obtained. Data obtained was on the credit history, purpose of the loan, loan amount, nature of the saving account, employment status, sex of the applicant, age of the applicant, security used when acquiring the loan and the area of residence of the applicant (rural or urban). This study employed a quantitative research design, it deals with individual loans defaults as group characteristics of a borrower. The data was pre-processed by seeding using R- Software and then split into training dataset and test data set. The train data was used to train the logistic regression model by employing Supervised machine learning approach. The R-statistical software was used for the analysis of the data. The test data set was used to do cross-validation of the developed logistic model which later was used for analysis prediction of individual loan defaults. This study focused on the analysis of individual loan defaults in Kenya using the logistic regression model in Machine learning. The logistic regression model predicted 303 defaults from train data set, 122 non-defaults and misclassified loans were 56 and 69. The model had an accuracy of 0.7727 with the train data and 0.7333 with the test data. The logistic regression model showed a precision of 0.8440 and 0.8244 with the train and test data respectively. The performance of the model with both the train and test data was illustrated using a plot of train errors and test errors against sample size on the same axes. The plot showed that the performance of the model increases with an increase in sample size. The study recommended the use of logistic regression in conjunction with supervised machine learning approach in loan default prediction in financial institutions and also more research should be carried out on ensemble methods of loan defaults prediction in order to increase the prediction accuracy.

Download Full-text

Are the odds odd? Positive return evidence in German horse race betting

The Journal of Gambling Business and Economics ◽

10.5750/jgbe.v7i2.585 ◽

2013 ◽

Vol 7 (2) ◽

pp. 19-32

Author(s):

Philipp Heinrich Hoff

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Predictive Power ◽

Binary Logistic Regression ◽

Binary Logistic Regression Model ◽

Data Set ◽

Horse Race ◽

Unique Data ◽

Positive Return

The article deals with the question, if odds derived from the behavior of bettors in a pari-mutuel setting really reflect the chances of winning for a particular horse in a particular race. Using a unique data set with more than 46,000 race observations from Germany for the years 2001 to 2003 the paper presents evidence on the favorite long-shot bias, evaluates the predictive power of totalizator odds, uses a binary logistic regression model to predict probabilities of winning and finally suggests three wagering strategies that actually yield positive returns in a hold-out oriented analytical setting.

Download Full-text

Discrete choice and survival models in employee turnover analysis

Employee Relations ◽

10.1108/er-03-2017-0058 ◽

2018 ◽

Vol 40 (2) ◽

pp. 381-395 ◽

Cited By ~ 2

Author(s):

Rafa Madariaga ◽

Ramon Oller ◽

Joan Carles Martori

Keyword(s):

Logistic Regression ◽

Survival Analysis ◽

Regression Model ◽

Discrete Choice ◽

Survival Data ◽

Employee Turnover ◽

Logistic Regression Model ◽

Data Set ◽

Content Type ◽

Set Up

Purpose The purpose of this paper is to assess the capacity of two methodological approaches – discrete choice and survival analysis models – to investigate the relationship between socio-economic characteristics and turnover in a retailing company. A comparison of the estimation results under each model and their interpretation is carried out. The study provides a guide to determine, assess and interpret the effects of different driving factors behind turnover. Design/methodology/approach The authors use a data set containing information about 1,199 workers followed up between January 2007 and December 2009. First, not distinguishing voluntary and involuntary resignation, a binary logistic regression model and a Cox proportional hazards (PH) model for univariate survival data are set up and estimated. Second, distinguishing voluntary and involuntary resignation, a multinomial logistic regression model and a Cox PH model for competing risk data are set up and estimated. Findings When no distinction is made, the results point that wage and age exert a negative effect on turnover. Risk of resignation is higher for male, single, not married and Spanish nationals. When the distinction is made, previous results hold for voluntary turnover: wage, age, gender, marital status and nationality are significant. However, when explaining involuntary turnover, all variables except wage lose explaining power. The survival analysis approach is better suited as it measures risk of resignation in a longitudinal way. Discrete choice models only study the risk at a particular cut-off point (24 months in case of this study). Originality/value This paper is a systematic application, evaluation and comparison of four different statistical models for analysing employee turnover in a single firm. This work is original because no systematic comparison has been done in the context of turnover.

Download Full-text