Comparisons of ADABOOST, KNN, SVM and Logistic Regression in Classification of Imbalanced Dataset

Author(s):  
Hezlin Aryani Abd Rahman ◽  
Yap Bee Wah ◽  
Haibo He ◽  
Awang Bulgiba
2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 3044-3044
Author(s):  
David Haan ◽  
Anna Bergamaschi ◽  
Yuhong Ning ◽  
William Gibb ◽  
Michael Kesling ◽  
...  

3044 Background: Epigenomics assays have recently become popular tools for identification of molecular biomarkers, both in tissue and in plasma. In particular 5-hydroxymethyl-cytosine (5hmC) method, has been shown to enable the epigenomic regulation of gene expression and subsequent gene activity, with different patterns, across several tumor and normal tissues types. In this study we show that 5hmC profiles enable discrete classification of tumor and normal tissue for breast, colorectal, lung ovary and pancreas. Such classification was also recapitulated in cfDNA from patient with breast, colorectal, lung, ovarian and pancreatic cancers. Methods: DNA was isolated from 176 fresh frozen tissues from breast, colorectal, lung, ovary and pancreas (44 per tumor per tissue type and up to 11 tumor tissues for each stage (I-IV)) and up to 10 normal tissues per tissue type. cfDNA was isolated from plasma from 783 non-cancer individuals and 569 cancer patients. Plasma-isolated cfDNA and tumor genomic DNA, were enriched for the 5hmC fraction using chemical labelling, sequenced, and aligned to a reference genome to construct features sets of 5hmC patterns. Results: 5hmC multinomial logistic regression analysis was employed across tumor and normal tissues and identified a set of specific and discrete tumor and normal tissue gene-based features. This indicates that we can classify samples regardless of source, with a high degree of accuracy, based on tissue of origin and also distinguish between normal and tumor status.Next, we employed a stacked ensemble machine learning algorithm combining multiple logistic regression models across diverse feature sets to the cfDNA dataset composed of 783 non cancers and 569 cancers comprising 67 breast, 118 colorectal, 210 Lung, 71 ovarian and 100 pancreatic cancers. We identified a genomic signature that enable the classification of non-cancer versus cancers with an outer fold cross validation sensitivity of 49% (CI 45%-53%) at 99% specificity. Further, individual cancer outer fold cross validation sensitivity at 99% specificity, was measured as follows: breast 30% (CI 119% -42%); colorectal 41% (CI 32%-50%); lung 49% (CI 42%-56%); ovarian 72% (CI 60-82%); pancreatic 56% (CI 46%-66%). Conclusions: This study demonstrates that 5hmC profiles can distinguish cancer and normal tissues based on their origin. Further, 5hmC changes in cfDNA enables detection of the several cancer types: breast, colorectal, lung, ovarian and pancreatic cancers. Our technology provides a non-invasive tool for cancer detection with low risk sample collection enabling improved compliance than current screening methods. Among other utilities, we believe our technology could be applied to asymptomatic high-risk individuals thus enabling enrichment for those subjects that most need a diagnostic imaging follow up.


Author(s):  
Lina Li ◽  
Xinpei Wang ◽  
Xiaping Du ◽  
Yuanyuan Liu ◽  
Changchun Liu ◽  
...  

2018 ◽  
Vol 8 (9) ◽  
pp. 1569 ◽  
Author(s):  
Shengbing Wu ◽  
Hongkun Jiang ◽  
Haiwei Shen ◽  
Ziyi Yang

In recent years, gene selection for cancer classification based on the expression of a small number of gene biomarkers has been the subject of much research in genetics and molecular biology. The successful identification of gene biomarkers will help in the classification of different types of cancer and improve the prediction accuracy. Recently, regularized logistic regression using the L 1 regularization has been successfully applied in high-dimensional cancer classification to tackle both the estimation of gene coefficients and the simultaneous performance of gene selection. However, the L 1 has a biased gene selection and dose not have the oracle property. To address these problems, we investigate L 1 / 2 regularized logistic regression for gene selection in cancer classification. Experimental results on three DNA microarray datasets demonstrate that our proposed method outperforms other commonly used sparse methods ( L 1 and L E N ) in terms of classification performance.


AITI ◽  
2020 ◽  
Vol 17 (1) ◽  
pp. 42-55
Author(s):  
Radius Tanone ◽  
Arnold B Emmanuel

Bank XYZ is one of the banks in Kupang City, East Nusa Tenggara Province which has several ATM machines and is placed in several merchant locations. The existing ATM machine is one of the goals of customers and non-customers in conducting transactions at the ATM machine. The placement of the ATM machines sometimes makes the machine not used optimally by the customer to transact, causing the disposal of machine resources and a condition called Not Operational Transaction (NOP). With the data consisting of several independent variables with numeric types, it is necessary to know how the classification of the dependent variable is NOP. Machine learning approach with Logistic Regression method is the solution in doing this classification. Some research steps are carried out by collecting data, analyzing using machine learning using python programming and writing reports. The results obtained with this machine learning approach is the resulting prediction value of 0.507 for its classification. This means that in the future XYZ Bank can classify NOP conditions based on the behavior of customers or non-customers in making transactions using Bank XYZ ATM machines.  


2018 ◽  
pp. 67-72 ◽  
Author(s):  
D. V. Borisenko ◽  
◽  
I. V. Prisukhina ◽  
S. A. Lunev ◽  
◽  
...  

2018 ◽  
Vol 2 (334) ◽  
Author(s):  
Mirosław Krzyśko ◽  
Łukasz Smaga

In this paper, the binary classification problem of multi‑dimensional functional data is considered. To solve this problem a regression technique based on functional logistic regression model is used. This model is re‑expressed as a particular logistic regression model by using the basis expansions of functional coefficients and explanatory variables. Based on re‑expressed model, a classification rule is proposed. To handle with outlying observations, robust methods of estimation of unknown parameters are also considered. Numerical experiments suggest that the proposed methods may behave satisfactory in practice.


Author(s):  
Lauren Gilstrap ◽  
Rishi K. Wadhera ◽  
Andrea M. Austin ◽  
Stephen Kearing ◽  
Karen E. Joynt Maddox ◽  
...  

BACKGROUND In January 2011, Centers for Medicare and Medicaid Services expanded the number of inpatient diagnosis codes from 9 to 25, which may influence comorbidity counts and risk‐adjusted outcome rates for studies spanning January 2011. This study examines the association between (1) limiting versus not limiting diagnosis codes after 2011, (2) using inpatient‐only versus inpatient and outpatient data, and (3) using logistic regression versus the Centers for Medicare and Medicaid Services risk‐standardized methodology and changes in risk‐adjusted outcomes. METHODS AND RESULTS Using 100% Medicare inpatient and outpatient files between January 2009 and December 2013, we created 2 cohorts of fee‐for‐service beneficiaries aged ≥65 years. The acute myocardial infarction cohort and the heart failure cohort had 578 728 and 1 595 069 hospitalizations, respectively. We calculate comorbidities using (1) inpatient‐only limited diagnoses, (2) inpatient‐only unlimited diagnoses, (3) inpatient and outpatient limited diagnoses, and (4) inpatient and outpatient unlimited diagnoses. Across both cohorts, International Classification of Diseases, Ninth Revision ( ICD‐9 ) diagnoses and hierarchical condition categories increased after 2011. When outpatient data were included, there were no significant differences in risk‐adjusted readmission rates using logistic regression or the Centers for Medicare and Medicaid Services risk standardization. A difference‐in‐differences analysis of risk‐adjusted readmission trends before versus after 2011 found that no significant differences between limited and unlimited models for either cohort. CONCLUSIONS For studies that span 2011, researchers should consider limiting the number of inpatient diagnosis codes to 9 and/or including outpatient data to minimize the impact of the code expansion on comorbidity counts. However, the 2011 code expansion does not appear to significantly affect risk‐adjusted readmission rate estimates using either logistic or risk‐standardization models or when using or excluding outpatient data.


2009 ◽  
Vol 28 (30) ◽  
pp. 3798-3810 ◽  
Author(s):  
Jian Huang ◽  
Agus Salim ◽  
Kaibin Lei ◽  
Kathleen O'Sullivan ◽  
Yudi Pawitan

Sign in / Sign up

Export Citation Format

Share Document