Comparison of Supervised Classification Models on Textual Data

Text classification is an essential aspect in many applications, such as spam detection and sentiment analysis. With the growing number of textual documents and datasets generated through social media and news articles, an increasing number of machine learning methods are required for accurate textual classification. For this paper, a comprehensive evaluation of the performance of multiple supervised learning models, such as logistic regression (LR), decision trees (DT), support vector machine (SVM), AdaBoost (AB), random forest (RF), multinomial naive Bayes (NB), multilayer perceptrons (MLP), and gradient boosting (GB), was conducted to assess the efficiency and robustness, as well as limitations, of these models on the classification of textual data. SVM, LR, and MLP had better performance in general, with SVM being the best, while DT and AB had much lower accuracies amongst all the tested models. Further exploration on the use of different SVM kernels was performed, demonstrating the advantage of using linear kernels over polynomial, sigmoid, and radial basis function kernels for text classification. The effects of removing stop words on model performance was also investigated; DT performed better with stop words removed, while all other models were relatively unaffected by the presence or absence of stop words.

Download Full-text

The impact of different parameter sets on the classification of asteroid types

10.5194/epsc2021-807 ◽

2021 ◽

Author(s):

Hanna Klimczak ◽

Wojciech Kotłowski ◽

Dagmara Oszkiewicz ◽

Francesca DeMeo ◽

Agnieszka Kryszczyńska ◽

...

Keyword(s):

Gradient Boosting ◽

Support Vector ◽

Multilayer Perceptrons ◽

Machine Learning Methods ◽

Vector Machines ◽

Science Centre ◽

The Difference ◽

The Impact

The aim of the project is the classification of asteroids according to the most commonly used asteroid taxonomy (Bus-Demeo et al. 2009) with the use of various machine learning methods like Logistic Regression, Naive Bayes, Support Vector Machines, Gradient Boosting and Multilayer Perceptrons. Different parameter sets are used for classification in order to compare the quality of prediction with limited amount of data, namely the difference in performance between using the 0.45mu to 2.45mu spectral range and multiple spectral features, as well as performing the Prinicpal Component Analysis to reduce the dimensions of the spectral data. &#160; This work has been supported by grant&#160;No. 2017/25/B/ST9/00740 from the National Science Centre, Poland.

Download Full-text

EEG-Based Classification of the Driver Alertness State

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2020-3091 ◽

2020 ◽

Vol 6 (3) ◽

pp. 353-356

Author(s):

Martin Golz ◽

Sebastian Thomas ◽

Adolf Schenka

Keyword(s):

Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Weighting Matrix ◽

Machine Learning Methods ◽

Young Drivers ◽

Eeg Data ◽

Vector Machines ◽

Generalized Matrix

AbstractGMLVQ (Generalized Matrix Relevance Learning Vector Quantization) is a method of machine learning with an adaptive metric. While training, the prototype vectors as well as the weight matrix of the metric are adapted simultaneously. The method is presented in more detail and compared with other machine learning methods employing a fixed metric. It was investigated how accurately the methods can assign the 6-channel EEG of 25 young drivers, who drove overnight in the simulation lab, to the two classes of mild and severe drowsiness. Results of cross-validation show that GMLVQ is at 81.7 ± 1.3 % mean classification accuracy. It is not as accurate as support-vector machines (SVM) and gradient boosting machines (GBM) and cannot exploit the potential of learning adaptive metrics in the case of EEG data. However, information is provided on the relevance of each signal feature from the weighting matrix.

Download Full-text

Identification of Predictor Genes for Feed Efficiency in Beef Cattle by Applying Machine Learning Methods to Multi-Tissue Transcriptome Data

Frontiers in Genetics ◽

10.3389/fgene.2021.619857 ◽

2021 ◽

Vol 12 ◽

Author(s):

Weihao Chen ◽

Pâmela A. Alexandre ◽

Gabriela Ribeiro ◽

Heidge Fukumasu ◽

Wei Sun ◽

...

Keyword(s):

Machine Learning ◽

Feed Efficiency ◽

Gradient Boosting ◽

Support Vector ◽

Sequencing Data ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

High Feed ◽

Differential Gene

Machine learning (ML) methods have shown promising results in identifying genes when applied to large transcriptome datasets. However, no attempt has been made to compare the performance of combining different ML methods together in the prediction of high feed efficiency (HFE) and low feed efficiency (LFE) animals. In this study, using RNA sequencing data of five tissues (adrenal gland, hypothalamus, liver, skeletal muscle, and pituitary) from nine HFE and nine LFE Nellore bulls, we evaluated the prediction accuracies of five analytical methods in classifying FE animals. These included two conventional methods for differential gene expression (DGE) analysis (t-test and edgeR) as benchmarks, and three ML methods: Random Forests (RFs), Extreme Gradient Boosting (XGBoost), and combination of both RF and XGBoost (RX). Utility of a subset of candidate genes selected from each method for classification of FE animals was assessed by support vector machine (SVM). Among all methods, the smallest subsets of genes (117) identified by RX outperformed those chosen by t-test, edgeR, RF, or XGBoost in classification accuracy of animals. Gene co-expression network analysis confirmed the interactivity existing among these genes and their relevance within the network related to their prediction ranking based on ML. The results demonstrate a great potential for applying a combination of ML methods to large transcriptome datasets to identify biologically important genes for accurately classifying FE animals.

Download Full-text

MODIS-FIRMS and ground-truthing based wildfire likelihood mapping of Sikkim Himalaya using machine learning algorithms.

10.21203/rs.3.rs-750123/v1 ◽

2021 ◽

Author(s):

Polash Banerjee

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Tree Cover ◽

Anthropogenic Factors ◽

Gradient Boosting ◽

Support Vector ◽

Learning Methods ◽

Sikkim Himalaya ◽

Environmental Features ◽

Machine Learning Methods

Abstract Wildfires in limited extent and intensity can be a boon for the forest ecosystem. However, recent episodes of wildfires of 2019 in Australia and Brazil are sad reminders of their heavy ecological and economical costs. Understanding the role of environmental factors in the likelihood of wildfires in a spatial context would be instrumental in mitigating it. In this study, 14 environmental features encompassing meteorological, topographical, ecological, in situ and anthropogenic factors have been considered for preparing the wildfire likelihood map of Sikkim Himalaya. A comparative study on the efficiency of machine learning methods like Generalized Linear Model (GLM), Support Vector Machine (SVM), Random Forest (RF) and Gradient Boosting Model (GBM) has been performed to identify the best performing algorithm in wildfire prediction. The study indicates that all the machine learning methods are good at predicting wildfires. However, RF has outperformed, followed by GBM in the prediction. Also, environmental features like average temperature, average wind speed, proximity to roadways and tree cover percentage are the most important determinants of wildfires in Sikkim Himalaya. This study can be considered as a decision support tool for preparedness, efficient resource allocation and sensitization of people towards mitigation of wildfires in Sikkim.

Download Full-text

Classification of the fragrant styles and evaluation of the aromatic quality of flue-cured tobacco leaves by machine-learning methods

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720016500335 ◽

2016 ◽

Vol 14 (06) ◽

pp. 1650033 ◽

Cited By ~ 1

Author(s):

Li Gu ◽

Lichun Xue ◽

Qi Song ◽

Fengji Wang ◽

Huaqin He ◽

...

Keyword(s):

Evaluation System ◽

Chemical Compounds ◽

Support Vector ◽

Tobacco Leaves ◽

Machine Learning Methods ◽

Online Tools ◽

Svm Algorithm ◽

Assessment Performance

During commercial transactions, the quality of flue-cured tobacco leaves must be characterized efficiently, and the evaluation system should be easily transferable across different traders. However, there are over 3000 chemical compounds in flue-cured tobacco leaves; thus, it is impossible to evaluate the quality of flue-cured tobacco leaves using all the chemical compounds. In this paper, we used Support Vector Machine (SVM) algorithm together with 22 chemical compounds selected by ReliefF-Particle Swarm Optimization (R-PSO) to classify the fragrant style of flue-cured tobacco leaves, where the Accuracy (ACC) and Matthews Correlation Coefficient (MCC) were 90.95% and 0.80, respectively. SVM algorithm combined with 19 chemical compounds selected by R-PSO achieved the best assessment performance of the aromatic quality of tobacco leaves, where the PCC and MSE were 0.594 and 0.263, respectively. Finally, we constructed two online tools to classify the fragrant style and evaluate the aromatic quality of flue-cured tobacco leaf samples. These tools can be accessed at http://bioinformatics.fafu.edu.cn/tobacco .

Download Full-text

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

Ooredoo Rayek

International Journal of Technology Diffusion ◽

10.4018/ijtd.2020040105 ◽

2020 ◽

Vol 11 (2) ◽

pp. 66-81

Author(s):

Badia Klouche ◽

Sidi Mohamed Benslimane ◽

Sakina Rim Bennabi

Keyword(s):

Social Media ◽

Support Vector Machine ◽

Text Mining ◽

Sentiment Analysis ◽

Experimental Results ◽

Support Vector ◽

Textual Data ◽

New Strategy ◽

Set Up

Sentiment analysis is one of the recent areas of emerging research in the classification of sentiment polarity and text mining, particularly with the considerable number of opinions available on social media. The Algerian Operator Telephone Ooredoo, as other operators, deploys in its new strategy to conquer new customers, by exploiting their opinions through a sentiments analysis. The purpose of this work is to set up a system called “Ooredoo Rayek”, whose objective is to collect, transliterate, translate and classify the textual data expressed by the Ooredoo operator's customers. This article developed a set of rules allowing the transliteration from Algerian Arabizi to Algerian dialect. Furthermore, the authors used Naïve Bayes (NB) and (Support Vector Machine) SVM classifiers to assign polarity tags to Facebook comments from the official pages of Ooredoo written in multilingual and multi-dialect context. Experimental results show that the system obtains good performance with 83% of accuracy.

Download Full-text

Hierarchical attention networks for information extraction from cancer pathology reports

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocx131 ◽

2017 ◽

Vol 25 (3) ◽

pp. 321-330 ◽

Cited By ~ 29

Author(s):

Shang Gao ◽

Michael T Young ◽

John X Qiu ◽

Hong-Jun Yoon ◽

James B Christian ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Information Extraction ◽

Model Performance ◽

Gradient Boosting ◽

Support Vector ◽

Attention Networks ◽

Cancer Pathology ◽

Extreme Gradient Boosting ◽

Pathology Reports

Abstract Objective We explored how a deep learning (DL) approach based on hierarchical attention networks (HANs) can improve model performance for multiple information extraction tasks from unstructured cancer pathology reports compared to conventional methods that do not sufﬁciently capture syntactic and semantic contexts from free-text documents. Materials and Methods Data for our analyses were obtained from 942 deidentiﬁed pathology reports collected by the National Cancer Institute Surveillance, Epidemiology, and End Results program. The HAN was implemented for 2 information extraction tasks: (1) primary site, matched to 12 International Classification of Diseases for Oncology topography codes (7 breast, 5 lung primary sites), and (2) histological grade classiﬁcation, matched to G1–G4. Model performance metrics were compared to conventional machine learning (ML) approaches including naive Bayes, logistic regression, support vector machine, random forest, and extreme gradient boosting, and other DL models, including a recurrent neural network (RNN), a recurrent neural network with attention (RNN w/A), and a convolutional neural network. Results Our results demonstrate that for both information tasks, HAN performed signiﬁcantly better compared to the conventional ML and DL techniques. In particular, across the 2 tasks, the mean micro and macroF-scores for the HAN with pretraining were (0.852,0.708), compared to naive Bayes (0.518, 0.213), logistic regression (0.682, 0.453), support vector machine (0.634, 0.434), random forest (0.698, 0.508), extreme gradient boosting (0.696, 0.522), RNN (0.505, 0.301), RNN w/A (0.637, 0.471), and convolutional neural network (0.714, 0.460). Conclusions HAN-based DL models show promise in information abstraction tasks within unstructured clinical pathology reports.

Download Full-text

Enhanced Changeover Detection in Industry 4.0 Environments with Machine Learning

Sensors ◽

10.3390/s21175896 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5896

Author(s):

Eddi Miller ◽

Vladyslav Borysenko ◽

Moritz Heusinger ◽

Niklas Niedner ◽

Bastian Engelmann ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Binary Classification ◽

Model Performance ◽

Support Vector ◽

Milling Machine ◽

Vector Machines ◽

Changeover Times ◽

Flow Power

Changeover times are an important element when evaluating the Overall Equipment Effectiveness (OEE) of a production machine. The article presents a machine learning (ML) approach that is based on an external sensor setup to automatically detect changeovers in a shopfloor environment. The door statuses, coolant flow, power consumption, and operator indoor GPS data of a milling machine were used in the ML approach. As ML methods, Decision Trees, Support Vector Machines, (Balanced) Random Forest algorithms, and Neural Networks were chosen, and their performance was compared. The best results were achieved with the Random Forest ML model (97% F1 score, 99.72% AUC score). It was also carried out that model performance is optimal when only a binary classification of a changeover phase and a production phase is considered and less subphases of the changeover process are applied.

Download Full-text

Studi Komparasi Metode Machine Learning untuk Klasifikasi Citra Huruf Vokal Hiragana

JURNAL MEDIA INFORMATIKA BUDIDARMA ◽

10.30865/mib.v5i3.3083 ◽

2021 ◽

Vol 5 (3) ◽

pp. 905

Author(s):

Muhammad Afrizal Amrustian ◽

Vika Febri Muliati ◽

Elsa Elvira Awal

Keyword(s):

Machine Learning ◽

Comparative Study ◽

Image Classification ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Learning Methods ◽

Machine Learning Methods ◽

The Comparative Study

Japanese is one of the most difficult languages to understand and read. Japanese writing that does not use the alphabet is the reason for the difficulty of the Japanese language to read. There are three types of Japanese, namely kanji, katakana, and hiragana. Hiragana letters are the most commonly used type of writing. In addition, hiragana has a cursive nature, so each person's writing will be different. Machine learning methods can be used to read Japanese letters by recognizing the image of the letters. The Japanese letters that are used in this study are hiragana vowels. This study focuses on conducting a comparative study of machine learning methods for the image classification of Japanese letters. The machine learning methods that were successfully compared are Naïve Bayes, Support Vector Machine, Decision Tree, Random Forest, and K-Nearest Neighbor. The results of the comparative study show that the K-Nearest Neighbor method is the best method for image classification of hiragana vowels. K-Nearest Neighbor gets an accuracy of 89.4% with a low error rate.

Download Full-text