Matrix sketching for supervised classification with imbalanced classes

AbstractThe presence of imbalanced classes is more and more common in practical applications and it is known to heavily compromise the learning process. In this paper we propose a new method aimed at addressing this issue in binary supervised classification. Re-balancing the class sizes has turned out to be a fruitful strategy to overcome this problem. Our proposal performs re-balancing through matrix sketching. Matrix sketching is a recently developed data compression technique that is characterized by the property of preserving most of the linear information that is present in the data. Such property is guaranteed by the Johnson-Lindenstrauss’ Lemma (1984) and allows to embed an n-dimensional space into a reduced one without distorting, within an $$\epsilon $$ ϵ -size interval, the distances between any pair of points. We propose to use matrix sketching as an alternative to the standard re-balancing strategies that are based on random under-sampling the majority class or random over-sampling the minority one. We assess the properties of our method when combined with linear discriminant analysis (LDA), classification trees (C4.5) and Support Vector Machines (SVM) on simulated and real data. Results show that sketching can represent a sound alternative to the most widely used rebalancing methods.

Download Full-text

Multiscale Supervised Classification of Point Clouds with Urban and Forest Applications

Sensors ◽

10.3390/s19204523 ◽

2019 ◽

Vol 19 (20) ◽

pp. 4523 ◽

Cited By ~ 1

Author(s):

Carlos Cabo ◽

Celestino Ordóñez ◽

Fernando Sáchez-Lasheras ◽

Javier Roca-Pardiñas ◽

and Javier de Cos-Juez

Keyword(s):

Random Forest ◽

Laser Scanning ◽

Supervised Classification ◽

Computing Time ◽

Principal Component ◽

Point Clouds ◽

Support Vector ◽

Linear Discriminant ◽

Vector Machines ◽

Input Variables

We analyze the utility of multiscale supervised classification algorithms for object detection and extraction from laser scanning or photogrammetric point clouds. Only the geometric information (the point coordinates) was considered, thus making the method independent of the systems used to collect the data. A maximum of five features (input variables) was used, four of them related to the eigenvalues obtained from a principal component analysis (PCA). PCA was carried out at six scales, defined by the diameter of a sphere around each observation. Four multiclass supervised classification models were tested (linear discriminant analysis, logistic regression, support vector machines, and random forest) in two different scenarios, urban and forest, formed by artificial and natural objects, respectively. The results obtained were accurate (overall accuracy over 80% for the urban dataset, and over 93% for the forest dataset), in the range of the best results found in the literature, regardless of the classification method. For both datasets, the random forest algorithm provided the best solution/results when discrimination capacity, computing time, and the ability to estimate the relative importance of each variable are considered together.

Download Full-text

Generalized Discriminant Analysis Using a Kernel Approach

Neural Computation ◽

10.1162/089976600300014980 ◽

2000 ◽

Vol 12 (10) ◽

pp. 2385-2404 ◽

Cited By ~ 1145

Author(s):

G. Baudat ◽

F. Anouar

Keyword(s):

Discriminant Analysis ◽

Simulated Data ◽

Feature Space ◽

Real Data ◽

Decision Function ◽

Support Vector ◽

Linear Discriminant ◽

Kernel Approach ◽

Generalized Discriminant Analysis ◽

Nonlinear Discriminant Analysis

We present a new method that we call generalized discriminant analysis (GDA) to deal with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the support vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space. In the transformed space, linear properties make it easy to extend and generalize the classical linear discriminant analysis (LDA) to nonlinear discriminant analysis. The formulation is expressed as an eigenvalue problem resolution. Using a different kernel, one can cover a wide class of nonlinearities. For both simulated data and alternate kernels, we give classification results, as well as the shape of the decision function. The results are confirmed using real data to perform seed classification.

Download Full-text

Recurrent Kernel Machines: Computing with Infinite Echo State Networks

Neural Computation ◽

10.1162/neco_a_00200 ◽

2012 ◽

Vol 24 (1) ◽

pp. 104-133 ◽

Cited By ~ 42

Author(s):

Michiel Hermans ◽

Benjamin Schrauwen

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Input Data ◽

Dimensional Space ◽

Theoretical Research ◽

High Dimensional ◽

Support Vector ◽

Practical Applications ◽

Echo State Networks ◽

Vector Machines

Echo state networks (ESNs) are large, random recurrent neural networks with a single trained linear readout layer. Despite the untrained nature of the recurrent weights, they are capable of performing universal computations on temporal input data, which makes them interesting for both theoretical research and practical applications. The key to their success lies in the fact that the network computes a broad set of nonlinear, spatiotemporal mappings of the input data, on which linear regression or classification can easily be performed. One could consider the reservoir as a spatiotemporal kernel, in which the mapping to a high-dimensional space is computed explicitly. In this letter, we build on this idea and extend the concept of ESNs to infinite-sized recurrent neural networks, which can be considered recursive kernels that subsequently can be used to create recursive support vector machines. We present the theoretical framework, provide several practical examples of recursive kernels, and apply them to typical temporal tasks.

Download Full-text

Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests

BMC Research Notes ◽

10.1186/1756-0500-4-299 ◽

2011 ◽

Vol 4 (1) ◽

Cited By ~ 149

Author(s):

João Maroco ◽

Dina Silva ◽

Ana Rodrigues ◽

Manuela Guerreiro ◽

Isabel Santana ◽

...

Keyword(s):

Data Mining ◽

Neural Networks ◽

Logistic Regression ◽

Random Forests ◽

Real Data ◽

Support Vector ◽

Linear Discriminant ◽

Data Comparison ◽

Vector Machines ◽

Mining Methods

Download Full-text

Hybrid approaches to feature subset selection for data classification in high-dimensional feature space

Artificial Intelligence Research ◽

10.5430/air.v9n1p45 ◽

2020 ◽

Vol 9 (1) ◽

pp. 45

Author(s):

Maysa Ibrahem Almulla Khalaf ◽

John Q Gan

Keyword(s):

Dimensional Space ◽

Subset Selection ◽

Feature Space ◽

Feature Subset Selection ◽

High Dimensional ◽

Support Vector ◽

Feature Subset ◽

Linear Discriminant ◽

Hybrid Approaches ◽

Low Dimensional

This paper proposes two hybrid feature subset selection approaches based on the combination (union or intersection) of both supervised and unsupervised filter approaches before using a wrapper, aiming to obtain low-dimensional features with high accuracy and interpretability and low time consumption. Experiments with the proposed hybrid approaches have been conducted on seven high-dimensional feature datasets. The classifiers adopted are support vector machine (SVM), linear discriminant analysis (LDA), and K-nearest neighbour (KNN). Experimental results have demonstrated the advantages and usefulness of the proposed methods in feature subset selection in high-dimensional space in terms of the number of selected features and time spent to achieve the best classification accuracy.

Download Full-text

A Data-Driven Fault Diagnosis Method for Railway Turnouts

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198119837222 ◽

2019 ◽

Vol 2673 (4) ◽

pp. 448-457 ◽

Cited By ~ 9

Author(s):

Dongxiu Ou ◽

Rui Xue ◽

Ke Cui

Keyword(s):

Fault Diagnosis ◽

Imbalanced Data ◽

Principal Component ◽

Real Data ◽

Feature Reduction ◽

Data Driven ◽

Support Vector ◽

Linear Discriminant ◽

Railway Line ◽

Diagnosis Method

Turnout systems on railways are crucial for safety protection and improvements in efficiency. The statistics show that the most common faults in railway system are turnout system faults. Therefore, many railway systems have adopted the microcomputer monitoring system (MMS) to monitor their health and performance in real time. However, in practice, existing turnout fault diagnosis methods depend largely on human experience. In this paper, we propose a data-driven fault diagnosis method that monitors data from point machines collected using MMS. First, based on a derivative method, data features are extracted by segmenting the original sample. Then, we apply two methods for feature reduction: principal component analysis (PCA) and linear discriminant analysis (LDA). The results show that LDA gave a better performance in the cases studied. A problem that cannot be overlooked is that the imbalanced quantity of rare fault samples and abundant normal samples will reduce the accuracy of classic fault diagnosis models. To deal with this problem of imbalanced data, we propose a modified support vector machine (SVM) method. Finally, an experiment using real data collected from the Guangzhou Railway Line is presented, which demonstrates that our method is reliable and feasible in fault diagnosis. It can further assist engineers to perform timely repairs and maintenance work in the future.

Download Full-text

Human motion recognition based on SVM in VR art media interaction environment

Human-centric Computing and Information Sciences ◽

10.1186/s13673-019-0203-8 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 7

Author(s):

Fuquan Zhang ◽

Tsu-Yang Wu ◽

Jeng-Shyang Pan ◽

Gangyi Ding ◽

Zuoyong Li

Keyword(s):

Genetic Algorithm ◽

Dimensional Space ◽

Dimensional Subspace ◽

Recognition Algorithm ◽

Human Motion ◽

Support Vector ◽

Motion Recognition ◽

Linear Discriminant ◽

Classification Feature ◽

Human Motion Recognition

AbstractIn order to solve the problem of human motion recognition in multimedia interaction scenarios in virtual reality environment, a motion classification and recognition algorithm based on linear decision and support vector machine (SVM) is proposed. Firstly, the kernel function is introduced into the linear discriminant analysis for nonlinear projection to map the training samples into a high-dimensional subspace to obtain the best classification feature vector, which effectively solves the nonlinear problem and expands the sample difference. The genetic algorithm is used to realize the parameter search optimization of SVM, which makes full use of the advantages of genetic algorithm in multi-dimensional space optimization. The test results show that compared with other classification recognition algorithms, the proposed method has a good classification effect on multiple performance indicators of human motion recognition and has higher recognition accuracy and better robustness.

Download Full-text

Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach

Electronics ◽

10.3390/electronics10232950 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2950

Author(s):

Marián Trnka ◽

Sakhia Darjaa ◽

Marian Ritomský ◽

Róbert Sabo ◽

Milan Rusko ◽

...

Keyword(s):

Dimensional Space ◽

Speech Sound ◽

Semantic Content ◽

Training Data ◽

Support Vector ◽

Circumplex Model ◽

Practical Applications ◽

Additional Information ◽

Unseen Data ◽

Data Clusters

A frequently used procedure to examine the relationship between categorical and dimensional descriptions of emotions is to ask subjects to place verbal expressions representing emotions in a continuous multidimensional emotional space. This work chooses a different approach. It aims at creating a system predicting the values of Activation and Valence (AV) directly from the sound of emotional speech utterances without the use of its semantic content or any other additional information. The system uses X-vectors to represent sound characteristics of the utterance and Support Vector Regressor for the estimation the AV values. The system is trained on a pool of three publicly available databases with dimensional annotation of emotions. The quality of regression is evaluated on the test sets of the same databases. Mapping of categorical emotions to the dimensional space is tested on another pool of eight categorically annotated databases. The aim of the work was to test whether in each unseen database the predicted values of Valence and Activation will place emotion-tagged utterances in the AV space in accordance with expectations based on Russell’s circumplex model of affective space. Due to the great variability of speech data, clusters of emotions create overlapping clouds. Their average location can be represented by centroids. A hypothesis on the position of these centroids is formulated and evaluated. The system’s ability to separate the emotions is evaluated by measuring the distance of the centroids. It can be concluded that the system works as expected and the positions of the clusters follow the hypothesized rules. Although the variance in individual measurements is still very high and the overlap of emotion clusters is large, it can be stated that the AV coordinates predicted by the system lead to an observable separation of the emotions in accordance with the hypothesis. Knowledge from training databases can therefore be used to predict AV coordinates of unseen data of various origins. This could be used to detect high levels of stress or depression. With the appearance of more dimensionally annotated training data, the systems predicting emotional dimensions from speech sound will become more robust and usable in practical applications in call-centers, avatars, robots, information-providing systems, security applications, and the like.

Download Full-text

ECG Signal Classification using Support Vector Machine and Linear Discriminant Analysis

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i5.17201725 ◽

2019 ◽

Vol 7 (5) ◽

pp. 1720-1725

Author(s):

S. Grover ◽

Shailja .

Keyword(s):

Support Vector Machine ◽

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Signal Classification ◽

Support Vector ◽

Ecg Signal ◽

Linear Discriminant

Download Full-text

Application of Machine Learning Approaches for the Design and Study of Anticancer Drugs

Current Drug Targets ◽

10.2174/1389450119666180809122244 ◽

2019 ◽

Vol 20 (5) ◽

pp. 488-500 ◽

Cited By ~ 6

Author(s):

Yan Hu ◽

Yi Lu ◽

Shuo Wang ◽

Mengying Zhang ◽

Xiaosheng Qu ◽

...

Keyword(s):

Machine Learning ◽

Drug Design ◽

Anticancer Drugs ◽

Nearest Neighbor ◽

Cost Effective ◽

Support Vector ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Activity Prediction ◽

Linear Discriminant

Background: Globally the number of cancer patients and deaths are continuing to increase yearly, and cancer has, therefore, become one of the world's highest causes of morbidity and mortality. In recent years, the study of anticancer drugs has become one of the most popular medical topics. Objective: In this review, in order to study the application of machine learning in predicting anticancer drugs activity, some machine learning approaches such as Linear Discriminant Analysis (LDA), Principal components analysis (PCA), Support Vector Machine (SVM), Random forest (RF), k-Nearest Neighbor (kNN), and Naïve Bayes (NB) were selected, and the examples of their applications in anticancer drugs design are listed. Results: Machine learning contributes a lot to anticancer drugs design and helps researchers by saving time and is cost effective. However, it can only be an assisting tool for drug design. Conclusion: This paper introduces the application of machine learning approaches in anticancer drug design. Many examples of success in identification and prediction in the area of anticancer drugs activity prediction are discussed, and the anticancer drugs research is still in active progress. Moreover, the merits of some web servers related to anticancer drugs are mentioned.

Download Full-text