scholarly journals Cancer classification and biomarker selection via a penalized logsum network-based logistic regression model

2021 ◽  
Vol 29 ◽  
pp. 287-295
Author(s):  
Zhiming Zhou ◽  
Haihui Huang ◽  
Yong Liang

BACKGROUND: In genome research, it is particularly important to identify molecular biomarkers or signaling pathways related to phenotypes. Logistic regression model is a powerful discrimination method that can offer a clear statistical explanation and obtain the classification probability of classification label information. However, it is unable to fulfill biomarker selection. OBJECTIVE: The aim of this paper is to give the model efficient gene selection capability. METHODS: In this paper, we propose a new penalized logsum network-based regularization logistic regression model for gene selection and cancer classification. RESULTS: Experimental results on simulated data sets show that our method is effective in the analysis of high-dimensional data. For a large data set, the proposed method has achieved 89.66% (training) and 90.02% (testing) AUC performances, which are, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods. CONCLUSIONS: The proposed method can be considered a promising tool for gene selection and cancer classification of high-dimensional biological data.

2005 ◽  
Vol 01 (01) ◽  
pp. 129-145 ◽  
Author(s):  
XIAOBO ZHOU ◽  
XIAODONG WANG ◽  
EDWARD R. DOUGHERTY

In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables (gene expressions) and the small number of experimental conditions. Many gene-selection and classification methods have been proposed; however most of these treat gene selection and classification separately, and not under the same model. We propose a Bayesian approach to gene selection using the logistic regression model. The Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the minimum description length (MDL) principle are used in constructing the posterior distribution of the chosen genes. The same logistic regression model is then used for cancer classification. Fast implementation issues for these methods are discussed. The proposed methods are tested on several data sets including those arising from hereditary breast cancer, small round blue-cell tumors, lymphoma, and acute leukemia. The experimental results indicate that the proposed methods show high classification accuracies on these data sets. Some robustness and sensitivity properties of the proposed methods are also discussed. Finally, mixing logistic-regression based gene selection with other classification methods and mixing logistic-regression-based classification with other gene-selection methods are considered.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Xiao-Ying Liu ◽  
Sheng-Bing Wu ◽  
Wen-Quan Zeng ◽  
Zhan-Jiang Yuan ◽  
Hong-Bo Xu

AbstractBiomarker selection and cancer classification play an important role in knowledge discovery using genomic data. Successful identification of gene biomarkers and biological pathways can significantly improve the accuracy of diagnosis and help machine learning models have better performance on classification of different types of cancer. In this paper, we proposed a LogSum + L2 penalized logistic regression model, and furthermore used a coordinate decent algorithm to solve it. The results of simulations and real experiments indicate that the proposed method is highly competitive among several state-of-the-art methods. Our proposed model achieves the excellent performance in group feature selection and classification problems.


2021 ◽  
Author(s):  
Katrin Nissen ◽  
Stefan Rupp ◽  
Björn Guse ◽  
Uwe Ulbrich ◽  
Sergiy Vorogushyn ◽  
...  

<p>In this study we present the results of a logistic regression model aimed at describing changes in probabilities for rockfall events in Germany in response to changes in meteorological and hydrological conditions.</p><p>The rockfall events for this study are taken from the landslide database for Germany (Damm and Klose, 2015). The meteorological variables we tested as predictors for the logistic regression model are daily precipitation from the REGNIE data set (Rauthe et al. 2013), hourly precipitation from the RADKLIM radar climatology (Winterrath et al., 2018) and temperature from the E-OBS data set (Cornes et al., 2018). As there is no observational soil moisture data set covering the entire country, we used soil moisture modelled with the state-of-the-art hydrological model mHM (Samaniego et al. 2010), which was calibrated using gauge measurements.</p><p>In order to select the best statistical model we tested a large number of physically plausible combinations of meteorological and hydrological predictors. Each model was checked using cross-validation. The decision on the final model was based on the value of the logarithmic skill score and on expert judgement.</p><p>The final statistical model includes the local percentile of daily precipitation, total relative soil moisture and freeze-thawing cycles in the previous weeks as predictors. It was found that daily precipitation is the most important parameter in the model. An increase of daily precipitation from its median to its 80th percentile approximately doubles the probability for a rockfall event. Higher soil moisture and the occurrence of freeze-thaw cycles also increase the probability for rockfall events. </p><p><br>Cornes, R. C. et al., 2018: An ensemble version of the E‐OBS temperature and precipitation data sets. Journal of Geophysical Research: Atmospheres, 123, 9391– 9409.</p><p>Damm, B., Klose, M., 2015. The landslide database for Germany: Closing the gap at national level. Geomorphology 249, 82–93</p><p>Rauthe, M. et al., 2013: A Central European precipitation climatology – Part I: Generation and validation of a high-reso-lution gridded daily data set (HYRAS), Vol. 22(3), p 235–256.</p><p>Samaniego, L. et al., 2010: Multiscale parameter regionalization of a grid-based hydrologic model at the mesoscale. Water Resour. Res., 46,W05523</p><p>Winterrath, T. et al., 2018: RADKLIM Version 2017.002: Reprocessed gauge-adjusted radar data, one-hour precipitation sums (RW), DOI: 10.5676/DWD/RADKLIM_RW_V2017.002.</p>


2020 ◽  
Author(s):  
Niema Ghanad Poor ◽  
Nicholas C West ◽  
Rama Syamala Sreepada ◽  
Srinivas Murthy ◽  
Matthias Görges

BACKGROUND In the pediatric intensive care unit (PICU), quantifying illness severity can be guided by risk models to enable timely identification and appropriate intervention. Logistic regression models, including the pediatric index of mortality 2 (PIM-2) and pediatric risk of mortality III (PRISM-III), produce a mortality risk score using data that are routinely available at PICU admission. Artificial neural networks (ANNs) outperform regression models in some medical fields. OBJECTIVE In light of this potential, we aim to examine ANN performance, compared to that of logistic regression, for mortality risk estimation in the PICU. METHODS The analyzed data set included patients from North American PICUs whose discharge diagnostic codes indicated evidence of infection and included the data used for the PIM-2 and PRISM-III calculations and their corresponding scores. We stratified the data set into training and test sets, with approximately equal mortality rates, in an effort to replicate real-world data. Data preprocessing included imputing missing data through simple substitution and normalizing data into binary variables using PRISM-III thresholds. A 2-layer ANN model was built to predict pediatric mortality, along with a simple logistic regression model for comparison. Both models used the same features required by PIM-2 and PRISM-III. Alternative ANN models using single-layer or unnormalized data were also evaluated. Model performance was compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC) and their empirical 95% CIs. RESULTS Data from 102,945 patients (including 4068 deaths) were included in the analysis. The highest performing ANN (AUROC 0.871, 95% CI 0.862-0.880; AUPRC 0.372, 95% CI 0.345-0.396) that used normalized data performed better than PIM-2 (AUROC 0.805, 95% CI 0.801-0.816; AUPRC 0.234, 95% CI 0.213-0.255) and PRISM-III (AUROC 0.844, 95% CI 0.841-0.855; AUPRC 0.348, 95% CI 0.322-0.367). The performance of this ANN was also significantly better than that of the logistic regression model (AUROC 0.862, 95% CI 0.852-0.872; AUPRC 0.329, 95% CI 0.304-0.351). The performance of the ANN that used unnormalized data (AUROC 0.865, 95% CI 0.856-0.874) was slightly inferior to our highest performing ANN; the single-layer ANN architecture performed poorly and was not investigated further. CONCLUSIONS A simple ANN model performed slightly better than the benchmark PIM-2 and PRISM-III scores and a traditional logistic regression model trained on the same data set. The small performance gains achieved by this two-layer ANN model may not offer clinically significant improvement; however, further research with other or more sophisticated model designs and better imputation of missing data may be warranted. CLINICALTRIAL


2021 ◽  
Vol 8 ◽  
Author(s):  
I.-Ming Chiu ◽  
Wenhua Lu ◽  
Fangming Tian ◽  
Daniel Hart

Machine learning is about finding patterns and making predictions from raw data. In this study, we aimed to achieve two goals by utilizing the modern logistic regression model as a statistical tool and classifier. First, we analyzed the associations between Major Depressive Episode with Severe Impairment (MDESI) in adolescents with a list of broadly defined sociodemographic characteristics. Using findings from the logistic model, the second and ultimate goal was to identify the potential MDESI cases using a logistic model as a classifier (i.e., a predictive mechanism). Data on adolescents aged 12–17 years who participated in the National Survey on Drug Use and Health (NSDUH), 2011–2017, were pooled and analyzed. The logistic regression model revealed that compared with males and adolescents aged 12-13, females and those in the age groups of 14-15 and 16-17 had higher risk of MDESI. Blacks and Asians had lower risk of MDESI than Whites. Living in single-parent household, having less authoritative parents, having negative school experiences further increased adolescents' risk of having MDESI. The predictive model successfully identified 66% of the MDESI cases (recall rate) and accurately identified 72% of the MDESI and MDESI-free cases (accuracy rate) in the training data set. The rates of both recall and accuracy remained about the same (66 and 72%) using the test data. Results from this study confirmed that the logistic model, when used as a classifier, can identify potential cases of MDESI in adolescents with acceptable recall and reasonable accuracy rates. The algorithmic identification of adolescents at risk for depression may improve prevention and intervention.


Author(s):  
Dominic M. Obare ◽  
Gladys G. Njoroge ◽  
Moses M. Muraya

Financial institutions have a large amount of data on their borrowers, which can be used to predict the probability of borrowers defaulting their loan or not. Some of the models that have been used to predict individual loan defaults include linear discriminant analysis models and extreme value theory models. These models are parametric in nature since they assume that the response being investigated takes a particular functional form. However, there is a possibility that the functional form used to estimate the response is very different from the actual functional form of the response. The purpose of this research was to analyze individual loan defaults in Kenya using the logistic regression model. The data used in this study was obtained from equity bank of Kenya for the period between 2006 to 2016. A random sample of 1000 loan applicants whose loans had been approved by equity bank of Kenya during this period was obtained. Data obtained was on the credit history, purpose of the loan, loan amount, nature of the saving account, employment status, sex of the applicant, age of the applicant, security used when acquiring the loan and the area of residence of the applicant (rural or urban). This study employed a quantitative research design, it deals with individual loans defaults as group characteristics of a borrower. The data was pre-processed by seeding using R- Software and then split into training dataset and test data set. The train data was used to train the logistic regression model by employing Supervised machine learning approach. The R-statistical software was used for the analysis of the data. The test data set was used to do cross-validation of the developed logistic model which later was used for analysis prediction of individual loan defaults. This study focused on the analysis of individual loan defaults in Kenya using the logistic regression model in Machine learning. The logistic regression model predicted 303 defaults from train data set, 122 non-defaults and misclassified loans were 56 and 69. The model had an accuracy of 0.7727 with the train data and 0.7333 with the test data. The logistic regression model showed a precision of 0.8440 and 0.8244 with the train and test data respectively. The performance of the model with both the train and test data was illustrated using a plot of train errors and test errors against sample size on the same axes. The plot showed that the performance of the model increases with an increase in sample size. The study recommended the use of logistic regression in conjunction with supervised machine learning approach in loan default prediction in financial institutions and also more research should be carried out on ensemble methods of loan defaults prediction in order to increase the prediction accuracy.


2013 ◽  
Vol 7 (2) ◽  
pp. 19-32
Author(s):  
Philipp Heinrich Hoff

The article deals with the question, if odds derived from the behavior of bettors in a pari-mutuel setting really reflect the chances of winning for a particular horse in a particular race. Using a unique data set with more than 46,000 race observations from Germany for the years 2001 to 2003 the paper presents evidence on the favorite long-shot bias, evaluates the predictive power of totalizator odds, uses a binary logistic regression model to predict probabilities of winning and finally suggests three wagering strategies that actually yield positive returns in a hold-out oriented analytical setting.  


Sign in / Sign up

Export Citation Format

Share Document