A new parallel data geometry analysis algorithm to select training data for support vector machine

Yunfeng Shi;  ; Shu Lv; Kaibo Shi;  ;

doi:10.3934/math.2021806

A new parallel data geometry analysis algorithm to select training data for support vector machine

AIMS Mathematics ◽

10.3934/math.2021806 ◽

2021 ◽

Vol 6 (12) ◽

pp. 13931-13953

Author(s):

Yunfeng Shi ◽

◽

Shu Lv ◽

Kaibo Shi ◽

◽

...

Keyword(s):

Support Vector Machine ◽

Large Scale ◽

Computational Cost ◽

Training Data ◽

Support Vector ◽

Training Set ◽

Redundant Data ◽

Parallel Data ◽

Low Efficiency ◽

Geometry Analysis

<abstract><p>Support vector machine (SVM) is one of the most powerful technologies of machine learning, which has been widely concerned because of its remarkable performance. However, when dealing with the classification problem of large-scale datasets, the high complexity of SVM model leads to low efficiency and become impractical. Due to the sparsity of SVM in the sample space, this paper presents a new parallel data geometry analysis (PDGA) algorithm to reduce the training set of SVM, which helps to improve the efficiency of SVM training. The PDGA introduce Mahalanobis distance to measure the distance from each sample to its centroid. And based on this, proposes a method that can identify non support vectors and outliers at the same time to help remove redundant data. When the training set is further reduced, cosine angle distance analysis method is proposed to determine whether the samples are redundant data, ensure that the valuable data are not removed. Different from the previous data geometry analysis methods, the PDGA algorithm is implemented in parallel, which greatly saving the computational cost. Experimental results on artificial dataset and 6 real datasets show that the algorithm can adapt to different sample distributions. Which significantly reduce the training time and memory requirements without sacrificing the classification accuracy, and its performance is obviously better than the other five competitive algorithms.</p></abstract>

Download Full-text

Large-scale support vector machine classification with redundant data reduction

Neurocomputing ◽

10.1016/j.neucom.2014.10.102 ◽

2016 ◽

Vol 172 ◽

pp. 189-197 ◽

Cited By ~ 17

Author(s):

Xiang-Jun Shen ◽

Lei Mu ◽

Zhen Li ◽

Hao-Xiang Wu ◽

Jian-Ping Gou ◽

...

Keyword(s):

Support Vector Machine ◽

Data Reduction ◽

Large Scale ◽

Support Vector ◽

Support Vector Machine Classification ◽

Redundant Data

Download Full-text

A hybrid of fast K-nearest neighbor and improved directed acyclic graph support vector machine for large-scale supersonic inlet flow pattern recognition

Proceedings of the Institution of Mechanical Engineers Part G Journal of Aerospace Engineering ◽

10.1177/09544100211008601 ◽

2021 ◽

pp. 095441002110086

Author(s):

Huan Wu ◽

Yong-Ping Zhao ◽

Tan Hui-Jun

Keyword(s):

Pattern Recognition ◽

Flow Pattern ◽

Directed Acyclic Graph ◽

Large Scale ◽

Computational Cost ◽

Classification Error ◽

Support Vector ◽

Training Set ◽

Inlet Flow ◽

Supersonic Inlet

Inlet flow pattern recognition is one of the most crucial issues and also the foundation of protection control for supersonic air-breathing propulsion systems. This article proposes a hybrid algorithm of fast K-nearest neighbors (F-KNN) and improved directed acyclic graph support vector machine (I-DAGSVM) to solve this issue based on a large amount of experimental data. The basic idea behind the proposed algorithm is combining F-KNN and I-DAGSVM together to reduce the classification error and computational cost when dealing with big data. The proposed algorithm first finds a small set of nearest samples from the training set quickly by F-KNN and then trains a local I-DAGSVM classifier based on these nearest samples. Compared with standard KNN which needs to compare each test sample with the entire training set, F-KNN uses an efficient index-based strategy to quickly find nearest samples, but there also exists misclassification when the number of nearest samples belonging to different classes is the same. To cope with this, I-DAGSVM is adopted, and its tree structure is improved by a measure of class separability to overcome the sequential randomization in classifier generation and to reduce the classification error. In addition, the proposed algorithm compensates for the expensive computational cost of I-DAGSVM because it only needs to train a local classifier based on a small number of samples found by F-KNN instead of all training samples. With all these strategies, the proposed algorithm combines the advantages of both F-KNN and I-DAGSVM and can be applied to the issue of large-scale supersonic inlet flow pattern recognition. The experimental results demonstrate the effectiveness of the proposed algorithm in terms of classification accuracy and test time.

Download Full-text

Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Proceedings of the International Conference on Applied Statistics ◽

10.2478/icas-2019-0043 ◽

2019 ◽

Vol 1 (1) ◽

pp. 496-503 ◽

Cited By ~ 1

Author(s):

Maria Mihaela Truşcă

Keyword(s):

Neural Networks ◽

Support Vector Machine ◽

Computational Cost ◽

Data Representation ◽

Training Data ◽

Support Vector ◽

Svm Classifier ◽

Machine Model ◽

Text Data ◽

Numerical Attributes

Abstract Support Vector Machine model is one of the most intensive used text data classifiers ever since the moment of its development. However, its performance depends not only on its features but also on data preprocessing and model tuning. The main purpose of this paper is to compare the efficiency of more Support Vector Machine models using both TF-IDF approach and Word2Vec and Doc2Vec neural networks for text data representation. Besides the data vectorization process, I try to enhance the models’ efficiency by identifying which kind of kernel fits better the data or if it is just better to opt for the linear case. My results prove that for the “Reuters 21578” dataset, nonlinear Support Vector Machine is more efficient when the conversion of text data into numerical attributes is realized using Word2Vec models instead of TF-IDF and Doc2Vec representations. When it is considered that data meet linear separability requirements, TF-IDF representation outperforms all other options. Surprisingly, Doc2Vec models have the lowest performance and only in terms of computational cost they provide satisfactory results. This paper proves that while Word2Vec models are truly efficient for text data representation, Doc2Vec neural networks are unable to exceed even TF-IDF index representation. This evidence contradicts the common idea according to which Doc2Vec models should provide a better insight into the training data domain than Word2Vec models and certainly than the TF-IDF index.

Download Full-text

Research on Automated Defect Classification Based on Visual Sensing and Convolutional Neural Network-Support Vector Machine for GTA-Assisted Droplet Deposition Manufacturing Process

Metals ◽

10.3390/met11040639 ◽

2021 ◽

Vol 11 (4) ◽

pp. 639

Author(s):

Chen Ma ◽

Haifei Dang ◽

Jun Du ◽

Pengfei He ◽

Minbo Jiang ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Convolutional Neural Network ◽

Manufacturing Process ◽

Support Vector ◽

Defect Classification ◽

Droplet Deposition ◽

Network Support ◽

Visual Sensing ◽

Low Efficiency

This paper proposes a novel metal additive manufacturing process, which is a composition of gas tungsten arc (GTA) and droplet deposition manufacturing (DDM). Due to complex physical metallurgical processes involved, such as droplet impact, spreading, surface pre-melting, etc., defects, including lack of fusion, overflow and discontinuity of deposited layers always occur. To assure the quality of GTA-assisted DDM-ed parts, online monitoring based on visual sensing has been implemented. The current study also focuses on automated defect classification to avoid low efficiency and bias of manual recognition by the way of convolutional neural network-support vector machine (CNN-SVM). The best accuracy of 98.9%, with an execution time of about 12 milliseconds to handle an image, proved our model can be enough to use in real-time feedback control of the process.

Download Full-text

Large-Scale Twin Parametric Support Vector Machine Using Pinball Loss Function

IEEE Transactions on Systems Man and Cybernetics Systems ◽

10.1109/tsmc.2019.2896642 ◽

2019 ◽

pp. 1-17 ◽

Cited By ~ 4

Author(s):

Sweta Sharma ◽

Reshma Rastogi ◽

Suresh Chandra

Keyword(s):

Support Vector Machine ◽

Loss Function ◽

Large Scale ◽

Support Vector ◽

Pinball Loss

Download Full-text

Mie scattering and microparticle-based characterization of heavy metal ions and classification by statistical inference methods

Royal Society Open Science ◽

10.1098/rsos.190001 ◽

2019 ◽

Vol 6 (5) ◽

pp. 190001 ◽

Cited By ~ 1

Author(s):

Katherine E. Klug ◽

Christian M. Jennings ◽

Nicholas Lytal ◽

Lingling An ◽

Jeong-Yeol Yoon

Keyword(s):

Heavy Metal ◽

Support Vector Machine ◽

Metal Ions ◽

Heavy Metal Ions ◽

Mie Scattering ◽

Training Data ◽

Scattering Data ◽

Support Vector ◽

Linear Discriminant ◽

Machine Analysis

A straightforward method for classifying heavy metal ions in water is proposed using statistical classification and clustering techniques from non-specific microparticle scattering data. A set of carboxylated polystyrene microparticles of sizes 0.91, 0.75 and 0.40 µm was mixed with the solutions of nine heavy metal ions and two control cations, and scattering measurements were collected at two angles optimized for scattering from non-aggregated and aggregated particles. Classification of these observations was conducted and compared among several machine learning techniques, including linear discriminant analysis, support vector machine analysis, K-means clustering and K-medians clustering. This study found the highest classification accuracy using the linear discriminant and support vector machine analysis, each reporting high classification rates for heavy metal ions with respect to the model. This may be attributed to moderate correlation between detection angle and particle size. These classification models provide reasonable discrimination between most ion species, with the highest distinction seen for Pb(II), Cd(II), Ni(II) and Co(II), followed by Fe(II) and Fe(III), potentially due to its known sorption with carboxyl groups. The support vector machine analysis was also applied to three different mixture solutions representing leaching from pipes and mine tailings, and showed good correlation with single-species data, specifically with Pb(II) and Ni(II). With more expansive training data and further processing, this method shows promise for low-cost and portable heavy metal identification and sensing.

Download Full-text

Maximum Variance Hashing via Column Generation

Mathematical Problems in Engineering ◽

10.1155/2013/379718 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10

Author(s):

Lei Luo ◽

Chao Zhang ◽

Yongrui Qin ◽

Chunyuan Zhang

Keyword(s):

Column Generation ◽

Large Scale ◽

Web Search ◽

Nearest Neighbor ◽

Computational Cost ◽

Multimedia Retrieval ◽

Training Data ◽

Nonlinear Dimensionality Reduction ◽

Maximum Variance ◽

Data Volume

With the explosive growth of the data volume in modern applications such as web search and multimedia retrieval, hashing is becoming increasingly important for efficient nearest neighbor (similar item) search. Recently, a number of data-dependent methods have been developed, reflecting the great potential of learning for hashing. Inspired by the classic nonlinear dimensionality reduction algorithm—maximum variance unfolding, we propose a novel unsupervised hashing method, named maximum variance hashing, in this work. The idea is to maximize the total variance of the hash codes while preserving the local structure of the training data. To solve the derived optimization problem, we propose a column generation algorithm, which directly learns the binary-valued hash functions. We then extend it using anchor graphs to reduce the computational cost. Experiments on large-scale image datasets demonstrate that the proposed method outperforms state-of-the-art hashing methods in many cases.

Download Full-text

Analisis Sentimen Data Twitter Tentang Pasangan Capres-Cawapres Pemilu 2019 Dengan Metode Lexicon Based Dan Support Vector Machine

Jurnal Ilmiah FIFO ◽

10.22441/fifo.2019.v11i2.004 ◽

2019 ◽

Vol 11 (2) ◽

pp. 144

Author(s):

Danar Wido Seno ◽

Arief Wibowo

Keyword(s):

Social Media ◽

Support Vector Machine ◽

Sentiment Analysis ◽

Vice President ◽

Training Data ◽

Support Vector ◽

New Words ◽

Textual Data ◽

Data Content ◽

Combination Of Methods

Social media writing content growing make a lot of new words that appear on Twitter in the form of words and abbreviations that appear so that sentiment analysis is increasingly difficult to get high accuracy of textual data on Twitter social media. In this study, the authors conducted research on sentiment analysis of the pairs of candidates for President and Vice President of Indonesia in the 2019 Elections. To obtain higher accuracy results and accommodate the problem of textual data development on Twitter, the authors conducted a combination of methods to conduct the sentiment analysis with unsupervised and supervised methods. namely Lexicon Based. This study used Twitter data in October 2018 using the search keywords with the names of each pair of candidates for President and Vice President of the 2019 Elections totaling 800 datasets. From the study with 800 datasets the best accuracy was obtained with a value of 92.5% with 80% training data composition and 20% testing data with a Precision value in each class between 85.7% - 97.2% and Recall value for each class among 78, 2% - 93.5%. With the Lexicon Based method as a labeling dataset, the process of labeling the Support Vector Machine dataset is no longer done manually but is processed by the Lexicon Based method and the dictionary on the lexicon can be added along with the development of data content on Twitter social media.

Download Full-text

Structural Damage Detection Using Supervised Nonlinear Support Vector Machine

Journal of Composites Science ◽

10.3390/jcs5110303 ◽

2021 ◽

Vol 5 (11) ◽

pp. 303

Author(s):

Kian K. Sepahvand

Keyword(s):

Support Vector Machine ◽

Damage Detection ◽

Structural Damage ◽

Natural Frequencies ◽

Training Data ◽

Support Vector ◽

Lightweight Structures ◽

Straightforward Method ◽

Classification Boundary ◽

Nonlinear Support

Damage detection, using vibrational properties, such as eigenfrequencies, is an efficient and straightforward method for detecting damage in structures, components, and machines. The method, however, is very inefficient when the values of the natural frequencies of damaged and undamaged specimens exhibit slight differences. This is particularly the case with lightweight structures, such as fiber-reinforced composites. The nonlinear support vector machine (SVM) provides enhanced results under such conditions by transforming the original features into a new space or applying a kernel trick. In this work, the natural frequencies of damaged and undamaged components are used for classification, employing the nonlinear SVM. The proposed methodology assumes that the frequencies are identified sequentially from an experimental modal analysis; for the study propose, however, the training data are generated from the FEM simulations for damaged and undamaged samples. It is shown that nonlinear SVM using kernel function yields in a clear classification boundary between damaged and undamaged specimens, even for minor variations in natural frequencies.

Download Full-text

Analisis Sentimen Twitter untuk Teks Berbahasa Indonesia dengan Maximum Entropy dan Support Vector Machine

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.3499 ◽

2014 ◽

Vol 8 (1) ◽

pp. 91 ◽

Cited By ~ 5

Author(s):

Noviah Dwi Putranti ◽

Edi Winarko

Keyword(s):

Support Vector Machine ◽

Maximum Entropy ◽

Social Networking Site ◽

Training Data ◽

Classification Model ◽

Support Vector ◽

Public Sentiment ◽

Pos Tagger ◽

Negative Sentiment ◽

Bahasa Indonesia

AbstrakAnalisis sentimen dalam penelitian ini merupakan proses klasifikasi dokumen tekstual ke dalam dua kelas, yaitu kelas sentimen positif dan negatif. Data opini diperoleh dari jejaring sosial Twitter berdasarkan query dalam Bahasa Indonesia. Penelitian ini bertujuan untuk menentukan sentimen publik terhadap objek tertentu yang disampaikan di Twitter dalam bahasa Indonesia, sehingga membantu usaha untuk melakukan riset pasar atas opini publik. Data yang sudah terkumpul dilakukan proses preprocessing dan POS tagger untuk menghasilkan model klasifikasi melalui proses pelatihan. Teknik pengumpulan kata yang memiliki sentimen dilakukan dengan pendekatan berdasarkan kamus, yang dihasilkan dalam penelitian ini berjumlah 18.069 kata. Algoritma Maximum Entropy digunakan untuk POS tagger dan algoritma yang digunakan untuk membangun model klasifikasi atas data pelatihan dalam penelitian ini adalah Support Vector Machine. Fitur yang digunakan adalah unigram dengan fitur pembobotan TFIDF. Implementasi klasifikasi diperoleh akurasi 86,81 % pada pengujian 7 fold cross validation untuk tipe kernel Sigmoid. Pelabelan kelas secara manual dengan POS tagger menghasilkan akurasi 81,67%. Kata kunci—analisis sentimen, klasifikasi, maximum entropy POS tagger, support vector machine, twitter. AbstractSentiment analysis in this research classified textual documents into two classes, positive and negative sentiment. Opinion data obtained a query from social networking site Twitter of Indonesian tweet. This research uses Indonesian tweets. This study aims to determine public sentiment toward a particular object presented in Twitter businesses conduct market. Collected data then prepocessed to help POS tagged to generate classification models through the training process. Sentiment word collection has done the dictionary based approach, which is generated in this study consists 18.069 words. Maximum Entropy algorithm is used for POS tagger and the algorithms used to build the classification model on the training data is Support Vector Machine. The unigram features used are the features of TFIDF weighting.Classification implementation 86,81 % accuration at examination of 7 validation cross fold for the type of kernel of Sigmoid. Class labeling manually with POS tagger yield accuration 81,67 %. Keywords—sentiment analysis, classification, maximum entropy POS tagger, support vector machine, twitter.

Download Full-text