Classification of Soils into Hydrologic Groups Using Machine Learning

Hydrologic soil groups play an important role in the determination of surface runoff, which, in turn, is crucial for soil and water conservation efforts. Traditionally, placement of soil into appropriate hydrologic groups is based on the judgement of soil scientists, primarily relying on their interpretation of guidelines published by regional or national agencies. As a result, large-scale mapping of hydrologic soil groups results in widespread inconsistencies and inaccuracies. This paper presents an application of machine learning for classification of soil into hydrologic groups. Based on features such as percentages of sand, silt and clay, and the value of saturated hydraulic conductivity, machine learning models were trained to classify soil into four hydrologic groups. The results of the classification obtained using algorithms such as k-Nearest Neighbors, Support Vector Machine with Gaussian Kernel, Decision Trees, Classification Bagged Ensembles and TreeBagger (Random Forest) were compared to those obtained using estimation based on soil texture. The performance of these models was compared and evaluated using per-class metrics and micro- and macro-averages. Overall, performance metrics related to kNN, Decision Tree and TreeBagger exceeded those for SVM-Gaussian Kernel and Classification Bagged Ensemble. Among the four hydrologic groups, it was noticed that group B had the highest rate of false positives.

Download Full-text

Support Vector Machines in Big Data Classification: A Systematic Literature Review

10.21203/rs.3.rs-663359/v1 ◽

2021 ◽

Author(s):

Mohammad Hassan Almaspoor ◽

Ali Safaei ◽

Afshin Salajegheh ◽

Behrouz Minaei-Bidgoli

Keyword(s):

Machine Learning ◽

Big Data ◽

Large Scale ◽

Support Vector ◽

Research Areas ◽

Large Scale Data ◽

Training Samples ◽

Big Data Classification ◽

Scale Data

Abstract Classification is one of the most important and widely used issues in machine learning, the purpose of which is to create a rule for grouping data to sets of pre-existing categories is based on a set of training sets. Employed successfully in many scientific and engineering areas, the Support Vector Machine (SVM) is among the most promising methods of classification in machine learning. With the advent of big data, many of the machine learning methods have been challenged by big data characteristics. The standard SVM has been proposed for batch learning in which all data are available at the same time. The SVM has a high time complexity, i.e., increasing the number of training samples will intensify the need for computational resources and memory. Hence, many attempts have been made at SVM compatibility with online learning conditions and use of large-scale data. This paper focuses on the analysis, identification, and classification of existing methods for SVM compatibility with online conditions and large-scale data. These methods might be employed to classify big data and propose research areas for future studies. Considering its advantages, the SVM can be among the first options for compatibility with big data and classification of big data. For this purpose, appropriate techniques should be developed for data preprocessing in order to covert data into an appropriate form for learning. The existing frameworks should also be employed for parallel and distributed processes so that SVMs can be made scalable and properly online to be able to handle big data.

Download Full-text

A machine learning approach to predict ethnicity using personal name and census location in Canada

PLoS ONE ◽

10.1371/journal.pone.0241239 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0241239

Author(s):

Kai On Wong ◽

Osmar R. Zaïane ◽

Faith G. Davis ◽

Yutaka Yasui

Keyword(s):

Machine Learning ◽

First Nations ◽

Predictive Value ◽

Large Scale ◽

Performance Metrics ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approach ◽

Machine Learning Approach

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.

Download Full-text

Feature Selection from Lyme Disease Patient Survey Using Machine Learning

Algorithms ◽

10.3390/a13120334 ◽

2020 ◽

Vol 13 (12) ◽

pp. 334

Author(s):

Joshua Vendrow ◽

Jamie Haddock ◽

Deanna Needell ◽

Lorraine Johnson

Keyword(s):

Machine Learning ◽

Lyme Disease ◽

Large Scale ◽

Disease Patient ◽

Patient Survey ◽

Machine Learning Techniques ◽

Medical Community ◽

Support Vector ◽

Global Rating ◽

K Nearest Neighbors

Lyme disease is a rapidly growing illness that remains poorly understood within the medical community. Critical questions about when and why patients respond to treatment or stay ill, what kinds of treatments are effective, and even how to properly diagnose the disease remain largely unanswered. We investigate these questions by applying machine learning techniques to a large scale Lyme disease patient registry, MyLymeData, developed by the nonprofit LymeDisease.org. We apply various machine learning methods in order to measure the effect of individual features in predicting participants’ answers to the Global Rating of Change (GROC) survey questions that assess the self-reported degree to which their condition improved, worsened, or remained unchanged following antibiotic treatment. We use basic linear regression, support vector machines, neural networks, entropy-based decision tree models, and k-nearest neighbors approaches. We first analyze the general performance of the model and then identify the most important features for predicting participant answers to GROC. After we identify the “key” features, we separate them from the dataset and demonstrate the effectiveness of these features at identifying GROC. In doing so, we highlight possible directions for future study both mathematically and clinically.

Download Full-text

Multi-label classification approach for quranic verses labeling

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i1.pp484-490 ◽

2021 ◽

Vol 24 (1) ◽

pp. 484

Author(s):

Abdullahi Adeleke ◽

Noor Azah Samsudin ◽

Mohd Hisyam Abdul Rahim ◽

Shamsul Kamal Ahmad Khalid ◽

Riswan Efendi

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Research Work ◽

Support Vector ◽

Classification Problems ◽

K Nearest Neighbors ◽

Training Systems ◽

Learning Tasks ◽

Binary Relevance ◽

Problem Transformation

Machine learning involves the task of training systems to be able to make decisions without being explicitly programmed. Important among machine learning tasks is classification involving the process of training machines to make predictions from predefined labels. Classification is broadly categorized into three distinct groups: single-label (SL), multi-class, and multi-label (ML) classification. This research work presents an application of a multi-label classification (MLC) technique in automating Quranic verses labeling. MLC has been gaining attention in recent years. This is due to the increasing amount of works based on real-world classification problems of multi-label data. In traditional classification problems, patterns are associated with a single-label from a set of disjoint labels. However, in MLC, an instance of data is associated with a set of labels. In this paper, three standard <em>MLC</em> methods: <span>binary relevance (BR), classifier chain (CC), and label powerset (LP) algorithms are implemented with four baseline classifiers: support vector machine (SVM), naïve Bayes (NB), k-nearest neighbors (k-NN), and J48. The research methodology adopts the multi-label problem transformation (PT) approach. The results are validated using six conventional performance metrics. These include: hamming loss, accuracy, one error, micro-F1, macro-F1, and avg. precision. From the results, the classifiers effectively achieved above 70% accuracy mark. Overall, SVM achieved the best results with CC and LP algorithms.</span>

Download Full-text

Hyperparameter tuning for multi-label classification of feedbacks in online courses

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219238 ◽

2021 ◽

pp. 1-9

Author(s):

Dorian Ruiz Alonso ◽

Claudia Zepeda Cortés ◽

Hilda Castillo Zacatelco ◽

José Luis Carballido Carranza

Keyword(s):

Search Strategy ◽

Performance Metrics ◽

Online Courses ◽

Support Vector ◽

Grid Search ◽

K Nearest Neighbors ◽

Data Set ◽

Feedback Model ◽

Vector Machines

In this work, we propose the extension of a methodology for the multi-label classification of feedback according to the Hattie and Timperley feedback model, incorporating a hyperparameter tuning stage. It is analyzed whether the incorporation of the hyperparameter tuning stage prior to the execution of the algorithms support vector machines, random forest and multi-label k-nearest neighbors, improves the performance metrics of multi-label classifiers that automatically locate the feedback generated by a teacher to the activities sent by students in online courses on the Blackboard platform at the task, process, regulation, praise and other levels proposed in the feedback model by Hattie and Timperley. The grid search strategy is used to refine the hyperparameters of each algorithm. The results show that the adjustment of the hyperparameters improves the performance metrics for the data set used.

Download Full-text

A Comparison of the Performance of Supervised Learning Algorithms for Solar Power Prediction

Energies ◽

10.3390/en14154424 ◽

2021 ◽

Vol 14 (15) ◽

pp. 4424

Author(s):

Leidy Gutiérrez ◽

Julian Patiño ◽

Eduardo Duque-Grisales

Keyword(s):

Machine Learning ◽

Power Generation ◽

Large Scale ◽

Fossil Fuels ◽

Machine Learning Techniques ◽

Support Vector ◽

Power Prediction ◽

Electric Networks ◽

K Nearest Neighbors ◽

Supervised Learning Algorithms

Science seeks strategies to mitigate global warming and reduce the negative impacts of the long-term use of fossil fuels for power generation. In this sense, implementing and promoting renewable energy in different ways becomes one of the most effective solutions. The inaccuracy in the prediction of power generation from photovoltaic (PV) systems is a significant concern for the planning and operational stages of interconnected electric networks and the promotion of large-scale PV installations. This study proposes the use of Machine Learning techniques to model the photovoltaic power production for a system in Medellín, Colombia. Four forecasting models were generated from techniques compatible with Machine Learning and Artificial Intelligence methods: K-Nearest Neighbors (KNN), Linear Regression (LR), Artificial Neural Networks (ANN) and Support Vector Machines (SVM). The results obtained indicate that the four methods produced adequate estimations of photovoltaic energy generation. However, the best estimate according to RMSE and MAE is the ANN forecasting model. The proposed Machine Learning-based models were demonstrated to be practical and effective solutions to forecast PV power generation in Medellin.

Download Full-text

Machine Learning-Based Classification of Electrical Low Voltage Cable Degradation

Energies ◽

10.3390/en14102852 ◽

2021 ◽

Vol 14 (10) ◽

pp. 2852

Author(s):

Egnonnumi Lorraine Codjo ◽

Bashir Bakhshideh Zad ◽

Jean-François Toubeau ◽

Bruno François ◽

François Vallée

Keyword(s):

Machine Learning ◽

Large Scale ◽

Low Voltage ◽

Distribution Networks ◽

Machine Learning Techniques ◽

K Nearest Neighbors ◽

Cable Insulation ◽

Insulation Thickness ◽

Nodal Voltage

Low voltage distribution networks have not been traditionally designed to accommodate the large-scale integration of decentralized photovoltaic (PV) generations. The bidirectional power flows in existing networks resulting from the load demand and PV generation changes as well as the influence of ambient temperature led to voltage variations and increased the leakage current through the cable insulation. In this paper, a machine learning-based framework is implemented for the identification of cable degradation by using data from deployed smart meter (SM) measurements. Nodal voltage variations are supposed to be related to cable conditions (reduction of cable insulation thickness due to insulation wear) and to client net demand changes. Various machine learning techniques are applied for classification of nodal voltages according to the cable insulation conditions. Once trained according to the comprehensive generated datasets, the implemented techniques can classify new network operating points into a healthy or degraded cable condition with high accuracy in their predictions. The simulation results reveal that logistic regression and decision tree algorithms lead to a better prediction (with a 97.9% and 99.9% accuracy, respectively) result than the k-nearest neighbors (which reach only 76.7%). The proposed framework offers promising perspectives for the early identification of LV cable conditions by using SM measurements.

Download Full-text

Use of Machine Learning to Investigate the Quantitative Checklist for Autism in Toddlers (Q-CHAT) towards Early Autism Screening

Diagnostics ◽

10.3390/diagnostics11030574 ◽

2021 ◽

Vol 11 (3) ◽

pp. 574

Author(s):

Gennaro Tartarisco ◽

Giovanni Cicceri ◽

Davide Di Pietro ◽

Elisa Leonardi ◽

Stefania Aiello ◽

...

Keyword(s):

Machine Learning ◽

High Performance ◽

Behavioral Science ◽

Autistic Traits ◽

Classification Performance ◽

Recursive Feature Elimination ◽

Diagnostic Tools ◽

Support Vector ◽

K Nearest Neighbors ◽

Autism Screening

In the past two decades, several screening instruments were developed to detect toddlers who may be autistic both in clinical and unselected samples. Among others, the Quantitative CHecklist for Autism in Toddlers (Q-CHAT) is a quantitative and normally distributed measure of autistic traits that demonstrates good psychometric properties in different settings and cultures. Recently, machine learning (ML) has been applied to behavioral science to improve the classification performance of autism screening and diagnostic tools, but mainly in children, adolescents, and adults. In this study, we used ML to investigate the accuracy and reliability of the Q-CHAT in discriminating young autistic children from those without. Five different ML algorithms (random forest (RF), naïve Bayes (NB), support vector machine (SVM), logistic regression (LR), and K-nearest neighbors (KNN)) were applied to investigate the complete set of Q-CHAT items. Our results showed that ML achieved an overall accuracy of 90%, and the SVM was the most effective, being able to classify autism with 95% accuracy. Furthermore, using the SVM–recursive feature elimination (RFE) approach, we selected a subset of 14 items ensuring 91% accuracy, while 83% accuracy was obtained from the 3 best discriminating items in common to ours and the previously reported Q-CHAT-10. This evidence confirms the high performance and cross-cultural validity of the Q-CHAT, and supports the application of ML to create shorter and faster versions of the instrument, maintaining high classification accuracy, to be used as a quick, easy, and high-performance tool in primary-care settings.

Download Full-text

Machine Learning Methods Applied to the Prediction of Pseudo-nitzschia spp. Blooms in the Galician Rias Baixas (NW Spain)

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10040199 ◽

2021 ◽

Vol 10 (4) ◽

pp. 199

Author(s):

Francisco M. Bellas Aláez ◽

Jesus M. Torres Palenzuela ◽

Evangelos Spyrakos ◽

Luis González Vilas

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Prediction Models ◽

Support Vector ◽

False Alarms ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Rías Baixas ◽

New Algorithms

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text