Semi-Automatization of Support Vector Machines to Map Lithium (Li) Bearing Pegmatites

Machine learning (ML) algorithms have shown great performance in geological remote sensing applications. The study area of this work was the Fregeneda–Almendra region (Spain–Portugal) where the support vector machine (SVM) was employed. Lithium (Li)-pegmatite exploration using satellite data presents some challenges since pegmatites are, by nature, small, narrow bodies. Consequently, the following objectives were defined: (i) train several SVM’s on Sentinel-2 images with different parameters to find the optimal model; (ii) assess the impact of imbalanced data; (iii) develop a successful methodological approach to delineate target areas for Li-exploration. Parameter optimization and model evaluation was accomplished by a two-staged grid-search with cross-validation. Several new methodological advances were proposed, including a region of interest (ROI)-based splitting strategy to create the training and test subsets, a semi-automatization of the classification process, and the application of a more innovative and adequate metric score to choose the best model. The proposed methodology obtained good results, identifying known Li-pegmatite occurrences as well as other target areas for Li-exploration. Also, the results showed that the class imbalance had a negative impact on the SVM performance since known Li-pegmatite occurrences were not identified. The potentials and limitations of the methodology proposed are highlighted and its applicability to other case studies is discussed.

Download Full-text

Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666190925125524 ◽

2019 ◽

Vol 22 (7) ◽

pp. 455-469

Author(s):

Yi-Heng Zhu ◽

Jun Hu ◽

Yong Qi ◽

Xiao-Ning Song ◽

Dong-Jun Yu

Keyword(s):

Machine Learning ◽

Support Vector Machines ◽

Ligand Binding ◽

Binding Sites ◽

Negative Impact ◽

Class Imbalance ◽

Nucleotide Binding ◽

Support Vector ◽

Vector Machines ◽

Ligand Binding Sites

Aim and Objective: The accurate identification of protein-ligand binding sites helps elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater than that of binding (minority) residues, has a negative impact on the performance of such machine-learning-based predictors. Materials and Methods: In this study, we aim to relieve the negative impact of class imbalance by Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is trained on a granular training subset consisting of all minority samples and some reasonably selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated by benchmarking it with several typical imbalance learning algorithms. We further implemented a protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm. Results: Rigorous cross-validation and independent validation tests for five types of proteinnucleotide interactions demonstrated that the proposed BGSVM-NUC achieves promising prediction performance and outperforms several popular sequence-based protein-nucleotide binding site predictors. The BGSVM-NUC web server is freely available at http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.

Download Full-text

Exploiting Correlation Subspace to Predict Heterogeneous Cross-Project Defects

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194016710017 ◽

2016 ◽

Vol 26 (09n10) ◽

pp. 1571-1580 ◽

Cited By ~ 6

Author(s):

Ming Cheng ◽

Guoqing Wu ◽

Hongyan Wan ◽

Guoan You ◽

Mengting Yuan ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Feature Space ◽

Support Vector ◽

Class Imbalance Problem ◽

Classifier Design ◽

Imbalance Problem ◽

Project Data ◽

The Impact ◽

Cross Project

Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.

Download Full-text

Selecting the Optimal Combination Model of FSSVM for the Imbalance Datasets

Mathematical Problems in Engineering ◽

10.1155/2014/539430 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6

Author(s):

Chuandong Qin ◽

Huixia Zhao

Keyword(s):

Support Vector Machines ◽

Class Imbalance ◽

Imbalanced Data ◽

Support Vector ◽

Smooth Functions ◽

Separating Hyperplane ◽

Vector Machines ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Imbalanced Data Learning

Imbalanced data learning is one of the most active and important fields in machine learning research. The existing class imbalance learning methods can make Support Vector Machines (SVMs) less sensitive to class imbalance; they still suffer from the disturbance of outliers and noise present in the datasets. A kind of Fuzzy Smooth Support Vector Machines (FSSVMs) are proposed based on the Smooth Support Vector Machine (SSVM) of O. L. Mangasarian. SSVM can be computed by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm or the Newton-Armijo algorithm easily. Two kinds of fuzzy memberships and three smooth functions can be chosen in the algorithms. The fuzzy memberships consider the contribution rate of each sample to the optimal separating hyperplane. The polynomial smooth functions can make the optimization problem more accurate at the inflection point. Those changes play the active effects on trials. The results of the experiments show that the FSSVMs can gain the better accuracy and the shorter time than the SSVMs and some of the other methods.

Download Full-text

Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Mathematics ◽

10.3390/math9090936 ◽

2021 ◽

Vol 9 (9) ◽

pp. 936

Author(s):

Jianli Shao ◽

Xin Liu ◽

Wenqing He

Keyword(s):

Machine Learning ◽

Spatial Association ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Kernel Functions ◽

Support Vector ◽

Classification Problems ◽

Rare Class ◽

Data Adaptive

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.

Download Full-text

Intuitionistic Fuzzy Laplacian Twin Support Vector Machine for Semi-supervised Classification

Journal of the Operations Research Society of China ◽

10.1007/s40305-021-00354-9 ◽

2021 ◽

Author(s):

Jia-Bin Zhou ◽

Yan-Qin Bai ◽

Yan-Ru Guo ◽

Hai-Xiang Lin

Keyword(s):

Support Vector Machine ◽

Negative Impact ◽

Twin Support Vector Machine ◽

Fuzzy Membership ◽

Support Vector ◽

Membership Functions ◽

Fuzzy Membership Functions ◽

Intuitionistic Fuzzy ◽

Benchmark Datasets ◽

The Impact

AbstractIn general, data contain noises which come from faulty instruments, flawed measurements or faulty communication. Learning with data in the context of classification or regression is inevitably affected by noises in the data. In order to remove or greatly reduce the impact of noises, we introduce the ideas of fuzzy membership functions and the Laplacian twin support vector machine (Lap-TSVM). A formulation of the linear intuitionistic fuzzy Laplacian twin support vector machine (IFLap-TSVM) is presented. Moreover, we extend the linear IFLap-TSVM to the nonlinear case by kernel function. The proposed IFLap-TSVM resolves the negative impact of noises and outliers by using fuzzy membership functions and is a more accurate reasonable classifier by using the geometric distribution information of labeled data and unlabeled data based on manifold regularization. Experiments with constructed artificial datasets, several UCI benchmark datasets and MNIST dataset show that the IFLap-TSVM has better classification accuracy than other state-of-the-art twin support vector machine (TSVM), intuitionistic fuzzy twin support vector machine (IFTSVM) and Lap-TSVM.

Download Full-text

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Complex & Intelligent Systems ◽

10.1007/s40747-021-00435-5 ◽

2021 ◽

Author(s):

Shwet Ketu ◽

Pramod Kumar Mishra

Keyword(s):

Air Pollution ◽

Air Quality ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Algorithm ◽

Quality Data ◽

Pollution Level ◽

Classification Problems ◽

Chi Square ◽

The Impact

AbstractIn the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.

Download Full-text

The impact of different parameter sets on the classification of asteroid types

10.5194/epsc2021-807 ◽

2021 ◽

Author(s):

Hanna Klimczak ◽

Wojciech Kotłowski ◽

Dagmara Oszkiewicz ◽

Francesca DeMeo ◽

Agnieszka Kryszczyńska ◽

...

Keyword(s):

Gradient Boosting ◽

Support Vector ◽

Multilayer Perceptrons ◽

Machine Learning Methods ◽

Vector Machines ◽

Science Centre ◽

The Difference ◽

The Impact

The aim of the project is the classification of asteroids according to the most commonly used asteroid taxonomy (Bus-Demeo et al. 2009) with the use of various machine learning methods like Logistic Regression, Naive Bayes, Support Vector Machines, Gradient Boosting and Multilayer Perceptrons. Different parameter sets are used for classification in order to compare the quality of prediction with limited amount of data, namely the difference in performance between using the 0.45mu to 2.45mu spectral range and multiple spectral features, as well as performing the Prinicpal Component Analysis to reduce the dimensions of the spectral data. &#160; This work has been supported by grant&#160;No. 2017/25/B/ST9/00740 from the National Science Centre, Poland.

Download Full-text

A Measure Optimized Cost-Sensitive Learning Framework for Imbalanced Data Classification

Advances in Data Mining and Database Management - Biologically-Inspired Techniques for Knowledge Discovery and Data Mining ◽

10.4018/978-1-4666-6078-6.ch003 ◽

2014 ◽

pp. 48-75 ◽

Cited By ~ 2

Author(s):

Peng Cao ◽

Osmar Zaiane ◽

Dazhe Zhao

Keyword(s):

Real World ◽

Class Imbalance ◽

Imbalanced Data ◽

Support Vector ◽

Feature Subset ◽

Cost Sensitive Learning ◽

Intrinsic Parameters ◽

Real World Problem ◽

Benchmark Datasets ◽

Feed Forward Neural Networks

Class imbalance is one of the challenging problems for machine-learning in many real-world applications. Many methods have been proposed to address and attempt to solve the problem, including sampling and cost-sensitive learning. The latter has attracted significant attention in recent years to solve the problem, but it is difficult to determine the precise misclassification costs in practice. There are also other factors that influence the performance of the classification including the input feature subset and the intrinsic parameters of the classifier. This chapter presents an effective wrapper framework incorporating the evaluation measure (AUC and G-mean) into the objective function of cost sensitive learning directly to improve the performance of classification by simultaneously optimizing the best pair of feature subset, intrinsic parameters, and misclassification cost parameter. The optimization is based on Particle Swarm Optimization (PSO). The authors use two different common methods, support vector machine and feed forward neural networks, to evaluate the proposed framework. Experimental results on various standard benchmark datasets with different ratios of imbalance and a real-world problem show that the proposed method is effective in comparison with commonly used sampling techniques.

Download Full-text