Machine Learning Algorithms for Classification of Gas Sensor Array Dataset

To measure the accuracy of the data being sensed predictive machine learning models have been used. These models take input in the form of datasets and predict the output based on them. By using a large dataset better and efficient predictive models can be designed because a large amount of data can be used to train the model. But having a larger dataset leads to a dimensionality problem. This problem is solved using Dimensionality Reduction Principal Component Analysis(PCA) algorithm. PCA helps to reduce the redundant data or correlated data present in the dataset by which dimensionality of the dataset is reduced. Classifier algorithms like K Nearest Neighbour(KNN), Logistic Regression(LR), Naive Bayes(NB), and Support Vector Machine(SVM) are used which gives output in the form of the confusion matrix. From this confusion matrix, the prediction accuracy of models is decided. From the accuracy measurements, it is found that the SVM model is more accurate(94%) in predicting the output whereas the NB model is the least accurate(60%).

Download Full-text

Extracted features based multi-class classification of orthodontic images

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i4.pp3558-3567 ◽

2020 ◽

Vol 10 (4) ◽

pp. 3558

Author(s):

Hicham Riri ◽

Mohammed Ed-Dhahraouy ◽

Abdelmajid Elmoutaouakkil ◽

Abderrahim Beni-Hssane ◽

Farid Bourzgui

Keyword(s):

Machine Learning ◽

Local Binary Pattern ◽

Principal Component ◽

Machine Learning Algorithms ◽

Support Vector ◽

Linear Discriminant ◽

Nearest Neighbours ◽

Multi Class Classification ◽

Pca Algorithm

The purpose of this study is to investigate computer vision and machine learning methods for classification of orthodontic images in order to provide orthodontists with a solution for multi-class classification of patients’ images to evaluate the evolution of their treatment. Of which, we proposed three algorithms based on extracted features, such as facial features and skin colour using YCbCrcolour space, assigned to nodes of a decision tree to classify orthodontic images: an algorithm for intra-oral images, an algorithm for mould images and an algorithm for extra-oral images. Then, we compared our method by implementing the Local Binary Pattern (LBP) algorithm to extract textural features from images. After that, we applied the principal component analysis (PCA) algorithm to optimize the redundant parameters in order to classify LBP features with six classifiers; Quadratic Support Vector Machine (SVM), Cubic SVM, Radial Basis Function SVM, Cosine K-Nearest Neighbours (KNN), Euclidian KNN, and Linear Discriminant Analysis (LDA). The presented algorithms have been evaluated on a dataset of images of 98 different patients, and experimental results demonstrate the good performances of our proposed method with a high accuracy compared with machine learning algorithms. Where LDA classifier achieves an accuracy of 84.5%.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Predicting Forest Fires using Supervised and Ensemble Machine Learning Algorithms

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2878.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 3697-3705 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Forest Fires ◽

Principal Component ◽

Climatic Conditions ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Physical Factors

Forest fires have become one of the most frequently occurring disasters in recent years. The effects of forest fires have a lasting impact on the environment as it lead to deforestation and global warming, which is also one of its major cause of occurrence. Forest fires are dealt by collecting the satellite images of forest and if there is any emergency caused by the fires then the authorities are notified to mitigate its effects. By the time the authorities get to know about it, the fires would have already caused a lot of damage. Data mining and machine learning techniques can provide an efficient prevention approach where data associated with forests can be used for predicting the eventuality of forest fires. This paper uses the dataset present in the UCI machine learning repository which consists of physical factors and climatic conditions of the Montesinho park situated in Portugal. Various algorithms like Logistic regression, Support Vector Machine, Random forest, K-Nearest neighbors in addition to Bagging and Boosting predictors are used, both with and without Principal Component Analysis (PCA). Among the models in which PCA was applied, Logistic Regression gave the highest F-1 score of 68.26 and among the models where PCA was absent, Gradient boosting gave the highest score of 68.36.

Download Full-text

Comparative Analysis of Machine Learning Algorithms for Computer-Assisted Reporting Based on Fully Automated Cross-Lingual RadLex® Mappings

10.20944/preprints202004.0354.v1 ◽

2020 ◽

Author(s):

Máté E. Maros ◽

Chang Gyu Cho ◽

Andreas G. Junge ◽

Benedikt Kämpgen ◽

Victor Saase ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Confusion Matrix ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Imaging Biomarkers ◽

Brier Score ◽

Support Vector ◽

Computer Assisted ◽

Cross Lingual

Objectives: Studies evaluating machine learning (ML) algorithms on cross-lingual RadLex® mappings for developing context-sensitive radiological reporting tools are lacking. Therefore, we investigated whether ML-based approaches can be utilized to assist radiologists in providing key imaging biomarkers – such as The Alberta stroke programme early CT score (APECTS). Material and Methods: A stratified random sample (age, gender, year) of CT reports (n=206) with suspected ischemic stroke was generated out of 3997 reports signed off between 2015-2019. Three independent, blinded readers assessed these reports and manually annotated clinico-radiologically relevant key features. The primary outcome was whether ASPECTS should have been provided (yes/no: 154/52). For all reports, both the findings and impressions underwent cross-lingual (German to English) RadLex®-mappings using natural language processing. Well-established ML-algorithms including classification trees, random forests, elastic net, support vector machines (SVMs) and boosted trees were evaluated in a 5 x 5-fold nested cross-validation framework. Further, a linear classifier (fastText) was directly fitted on the German reports. Ensemble learning was used to provide robust importance rankings of these ML-algorithms. Performance was evaluated using derivates of the confusion matrix and metrics of calibration including AUC, brier score and log loss as well as visually by calibration plots. Results: On this imbalanced classification task SVMs showed the highest accuracies both on human-extracted- (87%) and fully automated RadLex® features (findings: 82.5%; impressions: 85.4%). FastText without pre-trained language model showed the highest accuracy (89.3%) and AUC (92%) on the impressions. Ensemble learner revealed that boosted trees, fastText and SVMs are the most important ML-classifiers. Boosted trees fitted on the findings showed the best overall calibration curve. Conclusions: Contextual ML-based assistance suggesting ASPECTS while reporting neuroradiological emergencies is feasible, even if ML-models are restricted to be developed on limited and highly imbalanced data sets.

Download Full-text

Classification model for accuracy and intrusion detection using machine learning approach

PeerJ Computer Science ◽

10.7717/peerj-cs.437 ◽

2021 ◽

Vol 7 ◽

pp. e437

Author(s):

Arushi Agarwal ◽

Purushottam Sharma ◽

Mohammed Alshehri ◽

Ahmed A. Mohamed ◽

Osama Alfarraj

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Nearest Neighbor ◽

Performance Metrics ◽

Detection System ◽

Confusion Matrix ◽

Machine Learning Algorithms ◽

Classification Model ◽

Support Vector ◽

K Nearest Neighbor

In today’s cyber world, the demand for the internet is increasing day by day, increasing the concern of network security. The aim of an Intrusion Detection System (IDS) is to provide approaches against many fast-growing network attacks (e.g., DDoS attack, Ransomware attack, Botnet attack, etc.), as it blocks the harmful activities occurring in the network system. In this work, three different classification machine learning algorithms—Naïve Bayes (NB), Support Vector Machine (SVM), and K-nearest neighbor (KNN)—were used to detect the accuracy and reducing the processing time of an algorithm on the UNSW-NB15 dataset and to find the best-suited algorithm which can efficiently learn the pattern of the suspicious network activities. The data gathered from the feature set comparison was then applied as input to IDS as data feeds to train the system for future intrusion behavior prediction and analysis using the best-fit algorithm chosen from the above three algorithms based on the performance metrics found. Also, the classification reports (Precision, Recall, and F1-score) and confusion matrix were generated and compared to finalize the support-validation status found throughout the testing phase of the model used in this approach.

Download Full-text

Evaluation of Different Machine Learning Algorithms for Scalable Classification of Tree Types and Tree Species Based on Sentinel-2 Data

Remote Sensing ◽

10.3390/rs10091419 ◽

2018 ◽

Vol 10 (9) ◽

pp. 1419 ◽

Cited By ~ 31

Author(s):

Mathias Wessel ◽

Melanie Brandmeier ◽

Dirk Tiede

Keyword(s):

Machine Learning ◽

Tree Species ◽

Confusion Matrix ◽

Machine Learning Algorithms ◽

Support Vector ◽

Inventory Data ◽

Oak Trees ◽

Object Based ◽

Sentinel 2

We use freely available Sentinel-2 data and forest inventory data to evaluate the potential of different machine-learning approaches to classify tree species in two forest regions in Bavaria, Germany. Atmospheric correction was applied to the level 1C data, resulting in true surface reflectance or bottom of atmosphere (BOA) output. We developed a semiautomatic workflow for the classification of deciduous (mainly spruce trees), beech and oak trees by evaluating different classification algorithms (object- and pixel-based) in an architecture optimized for distributed processing. A hierarchical approach was used to evaluate different band combinations and algorithms (Support Vector Machines (SVM) and Random Forest (RF)) for the separation of broad-leaved vs. coniferous trees. The Ebersberger forest was the main project region and the Freisinger forest was used in a transferability study. Accuracy assessment and training of the algorithms was based on inventory data, validation was conducted using an independent dataset. A confusion matrix, with User´s and Producer´s Accuracies, as well as Overall Accuracies, was created for all analyses. In total, we tested 16 different classification setups for coniferous vs. broad-leaved trees, achieving the best performance of 97% for an object-based multitemporal SVM approach using only band 8 from three scenes (May, August and September). For the separation of beech and oak trees we evaluated 54 different setups, the best result achieved an accuracy of 91% for an object-based, SVM, multitemporal approach using bands 8, 2 and 3 of the May scene for segmentation and all principal components of the August scene for classification. The transferability of the model was tested for the Freisinger forest and showed similar results. This project points out that Sentinel-2 had only marginally worse results than comparable commercial high-resolution satellite sensors and is well-suited for forest analysis on a tree-stand level.

Download Full-text

Development of CYP3A4 Inhibition Models: Comparisons of Machine-Learning Techniques and Molecular Descriptors

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057104274091 ◽

2005 ◽

Vol 10 (3) ◽

pp. 197-205 ◽

Cited By ~ 46

Author(s):

Rieko Arimoto ◽

Madhu-Ashni Prasad ◽

Eric M. Gifford

Keyword(s):

Machine Learning ◽

High Throughput Screening ◽

Computational Models ◽

Similarity Index ◽

Topological Indices ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Chemical Information ◽

Svm Model

Computational models of cytochrome P450 3A4 inhibition were developed based on high-throughput screening data for 4470 proprietary compounds. Multiple models differentiating inhibitors (IC50 <3 μM) and noninhibitors were generated using various machine-learning algorithms (recursive partitioning [RP], Bayesian classifier, logistic regression, k-nearest-neighbor, and support vector machine [SVM]) with structural fingerprints and topological indices. Nineteen models were evaluated by internal 10-fold cross-validation and also by an independent test set. Three most predictive models, Barnard Chemical Information (BCI)-fingerprint/SVM, MDL-keyset/SVM, and topological indices/RP, correctly classified 249, 248, and 236 compounds of 291 noninhibitors and 135, 137, and 147 compounds of 179 inhibitors in the validation set. Their overall accuracies were 82%, 82%, and 81%, respectively. Investigating applicability of the BCI/SVM model found a strong correlation between the predictive performance and the structural similarity to the training set. Using Tanimoto similarity index as a confidence measurement for the predictions, the limitation of the extrapolation was 0.7 in the case of the BCI/SVM model. Taking consensus of the 3 best models yielded a further improvement in predictive capability, kappa = 0.65 and accuracy = 83%. The consensus model could also be tuned to minimize either false positives or false negatives depending on the emphasis of the screening.

Download Full-text

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbaa321 ◽

2020 ◽

Author(s):

Zhenxing Wu ◽

Minfeng Zhu ◽

Yu Kang ◽

Elaine Lai-Han Leung ◽

Tailong Lei ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Support Vector Machine ◽

Gaussian Process Regression ◽

Principal Component ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Linear Svm

Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.

Download Full-text

Research Methods in Machine Learning: A Content Analysis

International Journal of Computer and Information Technology(2279-0764) ◽

10.24203/ijcit.v10i2.79 ◽

2021 ◽

Vol 10 (2) ◽

Author(s):

Jackson Kamiri ◽

Geoffrey Mariga

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Research Methods ◽

Confusion Matrix ◽

Machine Learning Algorithms ◽

Support Vector ◽

Research Approach ◽

Quantitative Research Methods ◽

Learning Research ◽

Prediction Problems

Research methods in machine learning play a pivotal role since the accuracy and reliability of the results are influenced by the research methods used. The main aims of this paper were to explore current research methods in machine learning, emerging themes, and the implications of those themes in machine learning research. To achieve this the researchers analyzed a total of 100 articles published since 2019 in IEEE journals. This study revealed that Machine learning uses quantitative research methods with experimental research design being the de facto research approach. The study also revealed that researchers nowadays use more than one algorithm to address a problem. Optimal feature selection has also emerged to be a key thing that researchers are using to optimize the performance of Machine learning algorithms. Confusion matrix and its derivatives are still the main ways used to evaluate the performance of algorithms, although researchers are now also considering the processing time taken by an algorithm to execute. Python programming languages together with its libraries are the most used tools in creating, training, and testing models. The most used algorithms in addressing both classification and prediction problems are; Naïve Bayes, Support Vector Machine, Random Forest, Artificial Neural Networks, and Decision Tree. The recurring themes identified in this study are likely to open new frontiers in Machine learning research.

Download Full-text

Comparing Methods of Feature Extraction of Brain Activities for Octave Illusion Classification Using Machine Learning

Sensors ◽

10.3390/s21196407 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6407

Author(s):

Nina Pilyugina ◽

Akihiko Tsukahara ◽

Keita Tanaka

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Feature Selection ◽

Principal Component ◽

Machine Learning Algorithms ◽

Recursive Feature Elimination ◽

Support Vector ◽

Selection Methods ◽

Automatic Feature Extraction ◽

Octave Illusion

The aim of this study was to find an efficient method to determine features that characterize octave illusion data. Specifically, this study compared the efficiency of several automatic feature selection methods for automatic feature extraction of the auditory steady-state responses (ASSR) data in brain activities to distinguish auditory octave illusion and nonillusion groups by the difference in ASSR amplitudes using machine learning. We compared univariate selection, recursive feature elimination, principal component analysis, and feature importance by testifying the results of feature selection methods by using several machine learning algorithms: linear regression, random forest, and support vector machine. The univariate selection with the SVM as the classification method showed the highest accuracy result, 75%, compared to 66.6% without using feature selection. The received results will be used for future work on the explanation of the mechanism behind the octave illusion phenomenon and creating an algorithm for automatic octave illusion classification.

Download Full-text