Supervised Learning for Binary Classification on US Adult Income

Li-Pang Chen

doi:10.32732/jmo.2021.13.2.80

Supervised Learning for Binary Classification on US Adult Income

Journal of Modeling and Optimization ◽

10.32732/jmo.2021.13.2.80 ◽

2021 ◽

Vol 13 (2) ◽

pp. 80-91

Author(s):

Li-Pang Chen

Keyword(s):

Missing Values ◽

Binary Classification ◽

Roc Curves ◽

Income Level ◽

Support Vector ◽

Classification Methods ◽

Gender Education ◽

The Us ◽

Auc Value ◽

R Functions

In this project, various binary classification methods have been used to make predictions about US adult income level in relation to social factors including age, gender, education, and marital status. We first explore descriptive statistics for the dataset and deal with missing values. After that, we examine some widely used classification methods, including logistic regression, discriminant analysis, support vector machine, random forest, and boosting. Meanwhile, we also provide suitable R functions to demonstrate applications. Various metrics such as ROC curves, accuracy, recall and F-measure are calculated to compare the performance of these models. We find the boosting is the best method in our data analysis due to its highest AUC value and the highest prediction accuracy. In addition, among all predictor variables, we also find three variables that have the largest impact on the US adult income level.

Download Full-text

Prediction of Incident Cancers in the Lifelines Population-Based Cohort

Cancers ◽

10.3390/cancers13092133 ◽

2021 ◽

Vol 13 (9) ◽

pp. 2133

Author(s):

Francisco O. Cortés-Ibañez ◽

Sunil Belur Nagaraj ◽

Ludo Cornelissen ◽

Gerjan J. Navis ◽

Bert van der Vegt ◽

...

Keyword(s):

Cancer Incidence ◽

Binary Classification ◽

Predictive Performance ◽

Population Based ◽

Support Vector ◽

Clinical Variables ◽

Incident Cancer ◽

History Of ◽

Diagnosis Of Cancer ◽

Auc Value

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.

Download Full-text

Using Machine Learning Approaches to Explore Non-Cognitive Variables Influencing Reading Proficiency in English among Filipino Learners

Education Sciences ◽

10.3390/educsci11100628 ◽

2021 ◽

Vol 11 (10) ◽

pp. 628

Author(s):

Allan B. I. Bernardo ◽

Macario O. Cordel ◽

Rochelle Irene G. Lucas ◽

Jude Michael M. Teves ◽

Sashmir A. Yap ◽

...

Keyword(s):

Machine Learning ◽

School Environment ◽

Binary Classification ◽

Reading Proficiency ◽

Classification Model ◽

Support Vector ◽

Test Accuracy ◽

Learning Approaches ◽

Classification Methods ◽

Poor Reading

Filipino students ranked last in reading proficiency among all countries/territories in the PISA 2018, with only 19% meeting the minimum (Level 2) standard. It is imperative to understand the range of factors that contribute to low reading proficiency, specifically variables that can be the target of interventions to help students with poor reading proficiency. We used machine learning approaches, specifically binary classification methods, to identify the variables that best predict low (Level 1b and lower) vs. higher (Level 1a or better) reading proficiency using the Philippine PISA data from a nationally representative sample of 15-year-old students. Several binary classification methods were applied, and the best classification model was derived using support vector machines (SVM), with 81.2% average test accuracy. The 20 variables with the highest impact in the model were identified and interpreted using a socioecological perspective of development and learning. These variables included students’ home-related resources and socioeconomic constraints, learning motivation and mindsets, classroom reading experiences with teachers, reading self-beliefs, attitudes, and experiences, and social experiences in the school environment. The results were discussed with reference to the need for a systems perspective to addresses poor proficiency, requiring interconnected interventions that go beyond students’ classroom reading.

Download Full-text

Classifiers Accuracy Improvement Based on Missing Data Imputation

Journal of Artificial Intelligence and Soft Computing Research ◽

10.1515/jaiscr-2018-0002 ◽

2018 ◽

Vol 8 (1) ◽

pp. 31-48 ◽

Cited By ~ 11

Author(s):

Ivan Jordanov ◽

Nedyalko Petrov ◽

Alessio Petrozziello

Keyword(s):

Classification Accuracy ◽

Missing Values ◽

Statistical Significance ◽

Roc Curves ◽

Radar Signal ◽

Support Vector ◽

Data Set ◽

Missing Data Imputation ◽

Vector Machines ◽

Real World Datasets

Abstract In this paper we investigate further and extend our previous work on radar signal identification and classification based on a data set which comprises continuous, discrete and categorical data that represent radar pulse train characteristics such as signal frequencies, pulse repetition, type of modulation, intervals, scan period, scanning type, etc. As the most of the real world datasets, it also contains high percentage of missing values and to deal with this problem we investigate three imputation techniques: Multiple Imputation (MI); K-Nearest Neighbour Imputation (KNNI); and Bagged Tree Imputation (BTI). We apply these methods to data samples with up to 60% missingness, this way doubling the number of instances with complete values in the resulting dataset. The imputation models performance is assessed with Wilcoxon’s test for statistical significance and Cohen’s effect size metrics. To solve the classification task, we employ three intelligent approaches: Neural Networks (NN); Support Vector Machines (SVM); and Random Forests (RF). Subsequently, we critically analyse which imputation method influences most the classifiers’ performance, using a multiclass classification accuracy metric, based on the area under the ROC curves. We consider two superclasses (‘military’ and ‘civil’), each containing several ‘subclasses’, and introduce and propose two new metrics: inner class accuracy (IA); and outer class accuracy (OA), in addition to the overall classification accuracy (OCA) metric. We conclude that they can be used as complementary to the OCA when choosing the best classifier for the problem at hand.

Download Full-text

Urban vegetation classification with high-resolution PlanetScope and SkySat multispectral imagery

Landscape & Environment ◽

10.21120/le/15/1/9 ◽

2021 ◽

Vol 15 (1) ◽

pp. 66-75

Author(s):

Loránd Szabó ◽

Dávid Abriha ◽

Kwanele Phinzi ◽

Szilárd Szabó

Keyword(s):

High Resolution ◽

Input Data ◽

Roc Curves ◽

Classification Performance ◽

Support Vector ◽

Urban Vegetation ◽

Classification Methods ◽

Linear Modeling ◽

Spectral Bands ◽

The Difference

In this study two high-resolution satellite imagery, the PlanetScope, and SkySat were compared based on their classification capabilities of urban vegetation. During the research, we applied Random Forest and Support Vector Machine classification methods at a study area, center of Rome, Italy. We performed the classifications based on the spectral bands, then we involved the NDVI index, too. We evaluated the classification performance of the classifiers using different sets of input data with ROC curves and AUC values. Additional statistical analyses were applied to reveal the correlation structure of the satellite bands and the NDVI and General Linear Modeling to evaluate the AUC of different models. Although different classification methods did not result in significantly differing outcomes (AUC values between 0.96 and 0.99), SVM’s performance was better. The contribution of NDVI resulted in significantly higher AUC values. SkySat’s bands provided slightly better input data related to PlanetScope but the difference was minimal (~3%); accordingly, both satellites ensured excellent classification results.

Download Full-text

PERBANDINGAN BEBERAPA METODE KLASIFIKASI DALAM MEMPREDIKSI INTERAKSI FARMAKODINAMIK

Indonesian Journal of Statistics and Its Applications ◽

10.29244/ijsa.v4i1.328 ◽

2020 ◽

Vol 4 (1) ◽

pp. 11-21

Author(s):

Hasnita Hasnita ◽

Farit Mochamad Afendi ◽

Anwar Fitrianto

Keyword(s):

Binary Logistic Regression ◽

Process Variable ◽

Support Vector ◽

Classification Methods ◽

Chemical Similarity ◽

Classification Technique ◽

Vector Machines ◽

Scaling Process ◽

Auc Value ◽

Variable Interaction

One mechanism for Drug-Drug Interaction (DDI) is pharmacodynamic (PD) interactions. They are interactions by which the effects of a drug are changed by other drugs at the site of receptor. The interactions can be predicted based on Side Effects Similarity (SES), Chemical Similarity (CS) and Target Protein Connectedness (TPC). This study aims to find the best classification technique by first applying the scaling process, variable interaction, discretization and resampling technique. We used Random Forest, Support Vector Machines (SVM) and Binary Logistic Regression for the classification. Out the three classification methods, we found the SVM classification method produces the highest Area Under Cover (AUC) value compared to the other, which is 67.91%.

Download Full-text

Big Data to Knowledge: Application of Machine Learning to Predictive Modeling of Therapeutic Response in Cancer.

Current Genomics ◽

10.2174/1389202921999201224110101 ◽

2020 ◽

Vol 21 ◽

Author(s):

Sukanya Panja ◽

Sarra Rahem ◽

Cassandra J. Chu ◽

Antonina Mitrofanova

Keyword(s):

Machine Learning ◽

Missing Values ◽

Therapeutic Response ◽

Patient Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Complex Data ◽

Human Machine Interaction ◽

Data Repositories ◽

Response Modeling

Background: In recent years, the availability of high throughput technologies, establishment of large molecular patient data repositories, and advancement in computing power and storage have allowed elucidation of complex mechanisms implicated in therapeutic response in cancer patients. The breadth and depth of such data, alongside experimental noise and missing values, requires a sophisticated human-machine interaction that would allow effective learning from complex data and accurate forecasting of future outcomes, ideally embedded in the core of machine learning design. Objective: In this review, we will discuss machine learning techniques utilized for modeling of treatment response in cancer, including Random Forests, support vector machines, neural networks, and linear and logistic regression. We will overview their mathematical foundations and discuss their limitations and alternative approaches all in light of their application to therapeutic response modeling in cancer. Conclusion: We hypothesize that the increase in the number of patient profiles and potential temporal monitoring of patient data will define even more complex techniques, such as deep learning and causal analysis, as central players in therapeutic response modeling.

Download Full-text

Adapting two-class support vector classification methods to many class problems

Proceedings of the 22nd international conference on Machine learning - ICML '05 ◽

10.1145/1102351.1102391 ◽

2005 ◽

Cited By ~ 4

Author(s):

Simon I. Hill ◽

Arnaud Doucet

Keyword(s):

Support Vector ◽

Classification Methods

Download Full-text

A novel GPU based intrusion detection system using deep autoencoder with Fruitfly optimization

SN Applied Sciences ◽

10.1007/s42452-021-04579-4 ◽

2021 ◽

Vol 3 (6) ◽

Author(s):

R. Sekhar ◽

K. Sasirekha ◽

P. S. Raja ◽

K. Thangavel

Keyword(s):

Intrusion Detection ◽

Intrusion Detection System ◽

Missing Values ◽

Detection System ◽

Radial Basis Function Network ◽

Detection Methods ◽

Support Vector ◽

Parameter Method ◽

List Type ◽

Fuzzy C Means

Abstract Intrusion Detection Systems (IDSs) have received more attention to safeguarding the vital information in a network system of an organization. Generally, the hackers are easily entering into a secured network through loopholes and smart attacks. In such situation, predicting attacks from normal packets is tedious, much challenging, time consuming and highly technical. As a result, different algorithms with varying learning and training capacity have been explored in the literature. However, the existing Intrusion Detection methods could not meet the desired performance requirements. Hence, this work proposes a new Intrusion Detection technique using Deep Autoencoder with Fruitfly Optimization. Initially, missing values in the dataset have been imputed with the Fuzzy C-Means Rough Parameter (FCMRP) algorithm which handles the imprecision in datasets with the exploit of fuzzy and rough sets while preserving crucial information. Then, robust features are extracted from Autoencoder with multiple hidden layers. Finally, the obtained features are fed to Back Propagation Neural Network (BPN) to classify the attacks. Furthermore, the neurons in the hidden layers of Deep Autoencoder are optimized with population based Fruitfly Optimization algorithm. Experiments have been conducted on NSL_KDD and UNSW-NB15 dataset. The computational results of the proposed intrusion detection system using deep autoencoder with BPN are compared with Naive Bayes, Support Vector Machine (SVM), Radial Basis Function Network (RBFN), BPN, and Autoencoder with Softmax. Article Highlights A hybridized model using Deep Autoencoder with Fruitfly Optimization is introduced to classify the attacks. Missing values have been imputed with the Fuzzy C-Means Rough Parameter method. The discriminate features are extracted using Deep Autoencoder with more hidden layers.

Download Full-text

Fugl-Meyer hand motor imagination recognition for brain–computer interfaces using only fNIRS

Complex & Intelligent Systems ◽

10.1007/s40747-020-00266-w ◽

2021 ◽

Author(s):

Chenguang Li ◽

Hongjun Yang ◽

Long Cheng

Keyword(s):

Classification Accuracy ◽

Near Infrared ◽

Support Vector ◽

Classification Methods ◽

Experimental Paradigm ◽

Functional Near Infrared Spectroscopy ◽

Machine Method ◽

Physiological Signal ◽

Computer Interfaces ◽

Mode Decomposition

AbstractAs a relatively new physiological signal of brain, functional near-infrared spectroscopy (fNIRS) is being used more and more in brain–computer interface field, especially in the task of motor imagery. However, the classification accuracy based on this signal is relatively low. To improve the accuracy of classification, this paper proposes a new experimental paradigm and only uses fNIRS signals to complete the classification task of six subjects. Notably, the experiment is carried out in a non-laboratory environment, and movements of motion imagination are properly designed. And when the subjects are imagining the motions, they are also subvocalizing the movements to prevent distraction. Therefore, according to the motor area theory of the cerebral cortex, the positions of the fNIRS probes have been slightly adjusted compared with other methods. Next, the signals are classified by nine classification methods, and the different features and classification methods are compared. The results show that under this new experimental paradigm, the classification accuracy of 89.12% and 88.47% can be achieved using the support vector machine method and the random forest method, respectively, which shows that the paradigm is effective. Finally, by selecting five channels with the largest variance after empirical mode decomposition of the original signal, similar classification results can be achieved.

Download Full-text

New Multi-View Classification Method with Uncertain Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458282 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Bo Liu ◽

Haowen Zhong ◽

Yanshan Xiao

Keyword(s):

Learning Strategy ◽

State Of The Art ◽

Uncertain Data ◽

Real Life ◽

Support Vector ◽

Classification Methods ◽

Complementary Information ◽

Novel Approach ◽

Svm Model ◽

Iterative Framework

Multi-view classification aims at designing a multi-view learning strategy to train a classifier from multi-view data, which are easily collected in practice. Most of the existing works focus on multi-view classification by assuming the multi-view data are collected with precise information. However, we always collect the uncertain multi-view data due to the collection process is corrupted with noise in real-life application. In this case, this article proposes a novel approach, called uncertain multi-view learning with support vector machine (UMV-SVM) to cope with the problem of multi-view learning with uncertain data. The method first enforces the agreement among all the views to seek complementary information of multi-view data and takes the uncertainty of the multi-view data into consideration by modeling reachability area of the noise. Then it proposes an iterative framework to solve the proposed UMV-SVM model such that we can obtain the multi-view classifier for prediction. Extensive experiments on real-life datasets have shown that the proposed UMV-SVM can achieve a better performance for uncertain multi-view classification in comparison to the state-of-the-art multi-view classification methods.

Download Full-text