Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Jianli Shao; Xin Liu; Wenqing He

doi:10.3390/math9090936

Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Mathematics ◽

10.3390/math9090936 ◽

2021 ◽

Vol 9 (9) ◽

pp. 936

Author(s):

Jianli Shao ◽

Xin Liu ◽

Wenqing He

Keyword(s):

Machine Learning ◽

Spatial Association ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Kernel Functions ◽

Support Vector ◽

Classification Problems ◽

Rare Class ◽

Data Adaptive

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.

Download Full-text

Improving activated sludge classification based on imbalanced data

Journal of Hydroinformatics ◽

10.2166/hydro.2014.123 ◽

2014 ◽

Vol 16 (6) ◽

pp. 1331-1342 ◽

Cited By ~ 1

Author(s):

Y. Qian ◽

Y. C. Liang ◽

R. C. Guan

Keyword(s):

Neural Network ◽

Wastewater Treatment ◽

Activated Sludge ◽

Recognition Rate ◽

Back Propagation ◽

Area Under The Curve ◽

Imbalanced Data ◽

Real Data ◽

Support Vector ◽

Classification Problems

A fast and accurate classification method for sewage sludge biological activity classification is of great significance for wastewater treatment. However, the data are often imbalanced and the accuracy of traditional classification algorithms applied to imbalanced small classes of data is very low. Such small classes are crucial application data. Therefore, based on the analysis of eight microorganisms, a novel method is proposed in this paper for the classification of activated sludge known as balanced support-vector-based back-propagation (SV-BP) neural network. It first splits the multiclass classification problem into a plurality of pairwise classification problems and uses a support vector machine (SVM) to achieve equalization. Second, the new dataset is produced, following which back-propagation neural network (BPNN) is used for training and classification. To examine the efficiency of the model, 1731 real data points are collected from a wastewater treatment factory and divide the data into four classes with the help of wastewater experts. Based on the new model, data redundancy and noise are greatly reduced. With area under the curve (AUC) measurements, we find that the AUC of SV-BP is 6.9% higher than classical BPNN. In addition, the small-class recognition rate of SV-BP is far better than that by classical BPNN and SVM algorithms.

Download Full-text

A Comparative Study of Different Machine Learning Algorithms for Disease Prediction

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0177 ◽

2017 ◽

Vol 7 (7) ◽

pp. 172

Author(s):

Anantvir Singh Romana

Keyword(s):

Machine Learning ◽

Subsequent Treatment ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Disease Prediction ◽

Classification Problems ◽

Learning Techniques ◽

Neural Network Classifiers ◽

Diagnostic Detection

Accurate diagnostic detection of the disease in a patient is critical and may alter the subsequent treatment and increase the chances of survival rate. Machine learning techniques have been instrumental in disease detection and are currently being used in various classification problems due to their accurate prediction performance. Various techniques may provide different desired accuracies and it is therefore imperative to use the most suitable method which provides the best desired results. This research seeks to provide comparative analysis of Support Vector Machine, Naïve bayes, J48 Decision Tree and neural network classifiers breast cancer and diabetes datsets.

Download Full-text

Risky Driver Recognition with Class Imbalance Data and Automated Machine Learning Framework

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18147534 ◽

2021 ◽

Vol 18 (14) ◽

pp. 7534

Author(s):

Ke Wang ◽

Qingwen Xue ◽

Jian John Lu

Keyword(s):

Machine Learning ◽

High Risk ◽

Loss Function ◽

Class Imbalance ◽

Support Vector ◽

Trajectory Data ◽

Recognition Model ◽

Learning Framework ◽

Sampling Cost ◽

Automated Machine Learning

Identifying high-risk drivers before an accident happens is necessary for traffic accident control and prevention. Due to the class-imbalance nature of driving data, high-risk samples as the minority class are usually ill-treated by standard classification algorithms. Instead of applying preset sampling or cost-sensitive learning, this paper proposes a novel automated machine learning framework that simultaneously and automatically searches for the optimal sampling, cost-sensitive loss function, and probability calibration to handle class-imbalance problem in recognition of risky drivers. The hyperparameters that control sampling ratio and class weight, along with other hyperparameters, are optimized by Bayesian optimization. To demonstrate the performance of the proposed automated learning framework, we establish a risky driver recognition model as a case study, using video-extracted vehicle trajectory data of 2427 private cars on a German highway. Based on rear-end collision risk evaluation, only 4.29% of all drivers are labeled as risky drivers. The inputs of the recognition model are the discrete Fourier transform coefficients of target vehicle’s longitudinal speed, lateral speed, and the gap between the target vehicle and its preceding vehicle. Among 12 sampling methods, 2 cost-sensitive loss functions, and 2 probability calibration methods, the result of automated machine learning is consistent with manual searching but much more computation-efficient. We find that the combination of Support Vector Machine-based Synthetic Minority Oversampling TEchnique (SVMSMOTE) sampling, cost-sensitive cross-entropy loss function, and isotonic regression can significantly improve the recognition ability and reduce the error of predicted probability.

Download Full-text

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Complex & Intelligent Systems ◽

10.1007/s40747-021-00435-5 ◽

2021 ◽

Author(s):

Shwet Ketu ◽

Pramod Kumar Mishra

Keyword(s):

Air Pollution ◽

Air Quality ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Algorithm ◽

Quality Data ◽

Pollution Level ◽

Classification Problems ◽

Chi Square ◽

The Impact

AbstractIn the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.

Download Full-text

Forecasting Repair and Maintenance Services of Medical Devices Using Support Vector Machine

Volume 1: Additive Manufacturing; Advanced Materials Manufacturing; Biomanufacturing; Life Cycle Engineering; Manufacturing Equipment and Automation ◽

10.1115/msec2021-63966 ◽

2021 ◽

Author(s):

Hao-yu Liao ◽

Willie Cade ◽

Sara Behdad

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Medical Devices ◽

Kernel Functions ◽

Support Vector ◽

Quality Level ◽

Learning Tools ◽

Physical Damage ◽

Repair Service ◽

Product Failure

Abstract Accurate prediction of product failures and the need for repair services become critical for various reasons, including understanding the warranty performance of manufacturers, defining cost-efficient repair strategies, and compliance with safety standards. The purpose of this study is to use machine learning tools to analyze several parameters crucial for achieving a robust repair service system, including the number of repairs, the time of the next repair ticket or product failure, and the time to repair. A large dataset of over 530,000 repairs and maintenance of medical devices has been investigated by employing the Support Vector Machine (SVM) tool. SVM with four kernel functions is used to forecast the timing of the next failure or repair request in the system for two different products and two different failure types, namely random failure and physical damage. A frequency analysis is also conducted to explore the product quality level based on product failure and the time to repair it. Besides, the best probability distributions are fitted for the number of failures, the time between failures, and the time to repair. The results reveal the value of data analytics and machine learning tools in analyzing post-market product performance and the cost of repair and maintenance operations.

Download Full-text

Integration of synthetic minority oversampling technique for imbalanced class

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i1.pp102-108 ◽

2019 ◽

Vol 13 (1) ◽

pp. 102

Author(s):

Noviyanti Santoso ◽

Wahyu Wibowo ◽

Hilda Hikmawati

Keyword(s):

Machine Learning ◽

Data Mining ◽

Support Vector Machine ◽

Class Imbalance ◽

Original Data ◽

Support Vector ◽

Classification Methods ◽

Problematic Issue ◽

Imbalanced Class ◽

F Measure

In the data mining, a class imbalance is a problematic issue to look for the solutions. It probably because machine learning is constructed by using algorithms with assuming the number of instances in each balanced class, so when using a class imbalance, it is possible that the prediction results are not appropriate. They are solutions offered to solve class imbalance issues, including oversampling, undersampling, and synthetic minority oversampling technique (SMOTE). Both oversampling and undersampling have its disadvantages, so SMOTE is an alternative to overcome it. By integrating SMOTE in the data mining classification method such as Naive Bayes, Support Vector Machine (SVM), and Random Forest (RF) is expected to improve the performance of accuracy. In this research, it was found that the data of SMOTE gave better accuracy than the original data. In addition to the three classification methods used, RF gives the highest average AUC, F-measure, and G-means score.

Download Full-text

Forecasting Of Covid-19 Cases Using Machine Learning Approach

Current Respiratory Medicine Reviews ◽

10.2174/1573398x17666210129131009 ◽

2021 ◽

Vol 17 ◽

Author(s):

Sachin Kumar ◽

Karan Veer

Keyword(s):

Machine Learning ◽

Regression Model ◽

Model Performance ◽

Real Data ◽

Absolute Error ◽

Viral Disease ◽

Support Vector ◽

Family Welfare ◽

Accuracy Score ◽

Learning Approaches

Aims: The objective of this research is to predict the covid-19 cases in India based on the machine learning approaches. Background: Covid-19, a respiratory disease caused by one of the coronavirus family members, has led to a pandemic situation worldwide in 2020. This virus was detected firstly in Wuhan city of China in December 2019. This viral disease has taken less than three months to spread across the globe. Objective: In this paper, we proposed a regression model based on the Support vector machine (SVM) to forecast the number of deaths, the number of recovered cases, and total confirmed cases for the next 30 days. Method: For prediction, the data is collected from Github and the ministry of India's health and family welfare from March 14, 2020, to December 3, 2020. The model has been designed in Python 3.6 in Anaconda to forecast the forecasting value of corona trends until September 21, 2020. The proposed methodology is based on the prediction of values using SVM based regression model with polynomial, linear, rbf kernel. The dataset has been divided into train and test datasets with 40% and 60% test size and verified with real data. The model performance parameters are evaluated as a mean square error, mean absolute error, and percentage accuracy. Results and Conclusion: The results show that the polynomial model has obtained 95 % above accuracy score, linear scored above 90%, and rbf scored above 85% in predicting cumulative death, conformed cases, and recovered cases.

Download Full-text

A Measure Optimized Cost-Sensitive Learning Framework for Imbalanced Data Classification

Advances in Data Mining and Database Management - Biologically-Inspired Techniques for Knowledge Discovery and Data Mining ◽

10.4018/978-1-4666-6078-6.ch003 ◽

2014 ◽

pp. 48-75 ◽

Cited By ~ 2

Author(s):

Peng Cao ◽

Osmar Zaiane ◽

Dazhe Zhao

Keyword(s):

Real World ◽

Class Imbalance ◽

Imbalanced Data ◽

Support Vector ◽

Feature Subset ◽

Cost Sensitive Learning ◽

Intrinsic Parameters ◽

Real World Problem ◽

Benchmark Datasets ◽

Feed Forward Neural Networks

Class imbalance is one of the challenging problems for machine-learning in many real-world applications. Many methods have been proposed to address and attempt to solve the problem, including sampling and cost-sensitive learning. The latter has attracted significant attention in recent years to solve the problem, but it is difficult to determine the precise misclassification costs in practice. There are also other factors that influence the performance of the classification including the input feature subset and the intrinsic parameters of the classifier. This chapter presents an effective wrapper framework incorporating the evaluation measure (AUC and G-mean) into the objective function of cost sensitive learning directly to improve the performance of classification by simultaneously optimizing the best pair of feature subset, intrinsic parameters, and misclassification cost parameter. The optimization is based on Particle Swarm Optimization (PSO). The authors use two different common methods, support vector machine and feed forward neural networks, to evaluate the proposed framework. Experimental results on various standard benchmark datasets with different ratios of imbalance and a real-world problem show that the proposed method is effective in comparison with commonly used sampling techniques.

Download Full-text

Global and Local Clustering-Based Regression Models to Forecast Power Consumption in Buildings

Advances in Computer and Electrical Engineering - Handbook of Research on Emerging Technologies for Electrical Power Planning, Analysis, and Optimization ◽

10.4018/978-1-4666-9911-3.ch011 ◽

2016 ◽

pp. 207-234

Author(s):

Gonzalo Vergara ◽

Juan J. Carrasco ◽

Jesus Martínez-Gómez ◽

Manuel Domínguez ◽

José A. Gámez ◽

...

Keyword(s):

Machine Learning ◽

Power Consumption ◽

Electric Power ◽

Principal Component ◽

Real Data ◽

Machine Learning Techniques ◽

Electric Power Consumption ◽

Support Vector ◽

Learning Machines ◽

The University

The study of energy efficiency in buildings is an active field of research. Modeling and predicting energy related magnitudes leads to analyze electric power consumption and can achieve economical benefits. In this study, classical time series analysis and machine learning techniques, introducing clustering in some models, are applied to predict active power in buildings. The real data acquired corresponds to time, environmental and electrical data of 30 buildings belonging to the University of León (Spain). Firstly, we segmented buildings in terms of their energy consumption using principal component analysis. Afterwards, we applied state of the art machine learning methods and compare between them. Finally, we predicted daily electric power consumption profiles and compare them with actual data for different buildings. Our analysis shows that multilayer perceptrons have the lowest error followed by support vector regression and clustered extreme learning machines. We also analyze daily load profiles on weekdays and weekends for different buildings.

Download Full-text

Machine Learning Readmission Risk Modeling: A Pediatric Case Study

BioMed Research International ◽

10.1155/2019/8532892 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Patricio Wolff ◽

Manuel Graña ◽

Sebastián A. Ríos ◽

Maria Begoña Yarza

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Naive Bayes ◽

Class Imbalance ◽

Predictive Performance ◽

Naïve Bayes ◽

Distribution Model ◽

Training Dataset ◽

Support Vector ◽

Pediatric Hospital

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.

Download Full-text