Supervised Learning for Binary Classification on US Adult Income

2021 ◽  
Vol 13 (2) ◽  
pp. 80-91
Author(s):  
Li-Pang Chen

In this project, various binary classification methods have been used to make predictions about US adult income level in relation to social factors including age, gender, education, and marital status. We first explore descriptive statistics for the dataset and deal with missing values. After that, we examine some widely used classification methods, including logistic regression, discriminant analysis, support vector machine, random forest, and boosting. Meanwhile, we also provide suitable R functions to demonstrate applications. Various metrics such as ROC curves, accuracy, recall and F-measure are calculated to compare the performance of these models. We find the boosting is the best method in our data analysis due to its highest AUC value and the highest prediction accuracy. In addition, among all predictor variables, we also find three variables that have the largest impact on the US adult income level.

Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2133
Author(s):  
Francisco O. Cortés-Ibañez ◽  
Sunil Belur Nagaraj ◽  
Ludo Cornelissen ◽  
Gerjan J. Navis ◽  
Bert van der Vegt ◽  
...  

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.


2021 ◽  
Vol 11 (10) ◽  
pp. 628
Author(s):  
Allan B. I. Bernardo ◽  
Macario O. Cordel ◽  
Rochelle Irene G. Lucas ◽  
Jude Michael M. Teves ◽  
Sashmir A. Yap ◽  
...  

Filipino students ranked last in reading proficiency among all countries/territories in the PISA 2018, with only 19% meeting the minimum (Level 2) standard. It is imperative to understand the range of factors that contribute to low reading proficiency, specifically variables that can be the target of interventions to help students with poor reading proficiency. We used machine learning approaches, specifically binary classification methods, to identify the variables that best predict low (Level 1b and lower) vs. higher (Level 1a or better) reading proficiency using the Philippine PISA data from a nationally representative sample of 15-year-old students. Several binary classification methods were applied, and the best classification model was derived using support vector machines (SVM), with 81.2% average test accuracy. The 20 variables with the highest impact in the model were identified and interpreted using a socioecological perspective of development and learning. These variables included students’ home-related resources and socioeconomic constraints, learning motivation and mindsets, classroom reading experiences with teachers, reading self-beliefs, attitudes, and experiences, and social experiences in the school environment. The results were discussed with reference to the need for a systems perspective to addresses poor proficiency, requiring interconnected interventions that go beyond students’ classroom reading.


Author(s):  
Ivan Jordanov ◽  
Nedyalko Petrov ◽  
Alessio Petrozziello

Abstract In this paper we investigate further and extend our previous work on radar signal identification and classification based on a data set which comprises continuous, discrete and categorical data that represent radar pulse train characteristics such as signal frequencies, pulse repetition, type of modulation, intervals, scan period, scanning type, etc. As the most of the real world datasets, it also contains high percentage of missing values and to deal with this problem we investigate three imputation techniques: Multiple Imputation (MI); K-Nearest Neighbour Imputation (KNNI); and Bagged Tree Imputation (BTI). We apply these methods to data samples with up to 60% missingness, this way doubling the number of instances with complete values in the resulting dataset. The imputation models performance is assessed with Wilcoxon’s test for statistical significance and Cohen’s effect size metrics. To solve the classification task, we employ three intelligent approaches: Neural Networks (NN); Support Vector Machines (SVM); and Random Forests (RF). Subsequently, we critically analyse which imputation method influences most the classifiers’ performance, using a multiclass classification accuracy metric, based on the area under the ROC curves. We consider two superclasses (‘military’ and ‘civil’), each containing several ‘subclasses’, and introduce and propose two new metrics: inner class accuracy (IA); and outer class accuracy (OA), in addition to the overall classification accuracy (OCA) metric. We conclude that they can be used as complementary to the OCA when choosing the best classifier for the problem at hand.


2021 ◽  
Vol 15 (1) ◽  
pp. 66-75
Author(s):  
Loránd Szabó ◽  
Dávid Abriha ◽  
Kwanele Phinzi ◽  
Szilárd Szabó

In this study two high-resolution satellite imagery, the PlanetScope, and SkySat were compared based on their classification capabilities of urban vegetation. During the research, we applied Random Forest and Support Vector Machine classification methods at a study area, center of Rome, Italy. We performed the classifications based on the spectral bands, then we involved the NDVI index, too. We evaluated the classification performance of the classifiers using different sets of input data with ROC curves and AUC values. Additional statistical analyses were applied to reveal the correlation structure of the satellite bands and the NDVI and General Linear Modeling to evaluate the AUC of different models. Although different classification methods did not result in significantly differing outcomes (AUC values between 0.96 and 0.99), SVM’s performance was better. The contribution of NDVI resulted in significantly higher AUC values. SkySat’s bands provided slightly better input data related to PlanetScope but the difference was minimal (~3%); accordingly, both satellites ensured excellent classification results.


2020 ◽  
Vol 4 (1) ◽  
pp. 11-21
Author(s):  
Hasnita Hasnita ◽  
Farit Mochamad Afendi ◽  
Anwar Fitrianto

One mechanism for Drug-Drug Interaction (DDI) is pharmacodynamic (PD) interactions. They are interactions by which the effects of a drug are changed by other drugs at the site of receptor. The interactions can be predicted based on Side Effects Similarity (SES), Chemical Similarity (CS) and Target Protein Connectedness (TPC). This study aims to find the best classification technique by first applying the scaling process, variable interaction, discretization and resampling technique. We used Random Forest, Support Vector Machines (SVM) and Binary Logistic Regression for the classification. Out the three classification methods, we found the SVM classification method produces the highest Area Under Cover (AUC) value compared to the other, which is 67.91%.


2020 ◽  
Vol 21 ◽  
Author(s):  
Sukanya Panja ◽  
Sarra Rahem ◽  
Cassandra J. Chu ◽  
Antonina Mitrofanova

Background: In recent years, the availability of high throughput technologies, establishment of large molecular patient data repositories, and advancement in computing power and storage have allowed elucidation of complex mechanisms implicated in therapeutic response in cancer patients. The breadth and depth of such data, alongside experimental noise and missing values, requires a sophisticated human-machine interaction that would allow effective learning from complex data and accurate forecasting of future outcomes, ideally embedded in the core of machine learning design. Objective: In this review, we will discuss machine learning techniques utilized for modeling of treatment response in cancer, including Random Forests, support vector machines, neural networks, and linear and logistic regression. We will overview their mathematical foundations and discuss their limitations and alternative approaches all in light of their application to therapeutic response modeling in cancer. Conclusion: We hypothesize that the increase in the number of patient profiles and potential temporal monitoring of patient data will define even more complex techniques, such as deep learning and causal analysis, as central players in therapeutic response modeling.


2021 ◽  
Vol 3 (6) ◽  
Author(s):  
R. Sekhar ◽  
K. Sasirekha ◽  
P. S. Raja ◽  
K. Thangavel

Abstract Intrusion Detection Systems (IDSs) have received more attention to safeguarding the vital information in a network system of an organization. Generally, the hackers are easily entering into a secured network through loopholes and smart attacks. In such situation, predicting attacks from normal packets is tedious, much challenging, time consuming and highly technical. As a result, different algorithms with varying learning and training capacity have been explored in the literature. However, the existing Intrusion Detection methods could not meet the desired performance requirements. Hence, this work proposes a new Intrusion Detection technique using Deep Autoencoder with Fruitfly Optimization. Initially, missing values in the dataset have been imputed with the Fuzzy C-Means Rough Parameter (FCMRP) algorithm which handles the imprecision in datasets with the exploit of fuzzy and rough sets while preserving crucial information. Then, robust features are extracted from Autoencoder with multiple hidden layers. Finally, the obtained features are fed to Back Propagation Neural Network (BPN) to classify the attacks. Furthermore, the neurons in the hidden layers of Deep Autoencoder are optimized with population based Fruitfly Optimization algorithm. Experiments have been conducted on NSL_KDD and UNSW-NB15 dataset. The computational results of the proposed intrusion detection system using deep autoencoder with BPN are compared with Naive Bayes, Support Vector Machine (SVM), Radial Basis Function Network (RBFN), BPN, and Autoencoder with Softmax. Article Highlights A hybridized model using Deep Autoencoder with Fruitfly Optimization is introduced to classify the attacks. Missing values have been imputed with the Fuzzy C-Means Rough Parameter method. The discriminate features are extracted using Deep Autoencoder with more hidden layers.


Author(s):  
Chenguang Li ◽  
Hongjun Yang ◽  
Long Cheng

AbstractAs a relatively new physiological signal of brain, functional near-infrared spectroscopy (fNIRS) is being used more and more in brain–computer interface field, especially in the task of motor imagery. However, the classification accuracy based on this signal is relatively low. To improve the accuracy of classification, this paper proposes a new experimental paradigm and only uses fNIRS signals to complete the classification task of six subjects. Notably, the experiment is carried out in a non-laboratory environment, and movements of motion imagination are properly designed. And when the subjects are imagining the motions, they are also subvocalizing the movements to prevent distraction. Therefore, according to the motor area theory of the cerebral cortex, the positions of the fNIRS probes have been slightly adjusted compared with other methods. Next, the signals are classified by nine classification methods, and the different features and classification methods are compared. The results show that under this new experimental paradigm, the classification accuracy of 89.12% and 88.47% can be achieved using the support vector machine method and the random forest method, respectively, which shows that the paradigm is effective. Finally, by selecting five channels with the largest variance after empirical mode decomposition of the original signal, similar classification results can be achieved.


2021 ◽  
Vol 16 (1) ◽  
pp. 1-23
Author(s):  
Bo Liu ◽  
Haowen Zhong ◽  
Yanshan Xiao

Multi-view classification aims at designing a multi-view learning strategy to train a classifier from multi-view data, which are easily collected in practice. Most of the existing works focus on multi-view classification by assuming the multi-view data are collected with precise information. However, we always collect the uncertain multi-view data due to the collection process is corrupted with noise in real-life application. In this case, this article proposes a novel approach, called uncertain multi-view learning with support vector machine (UMV-SVM) to cope with the problem of multi-view learning with uncertain data. The method first enforces the agreement among all the views to seek complementary information of multi-view data and takes the uncertainty of the multi-view data into consideration by modeling reachability area of the noise. Then it proposes an iterative framework to solve the proposed UMV-SVM model such that we can obtain the multi-view classifier for prediction. Extensive experiments on real-life datasets have shown that the proposed UMV-SVM can achieve a better performance for uncertain multi-view classification in comparison to the state-of-the-art multi-view classification methods.


Sign in / Sign up

Export Citation Format

Share Document