A New Method for Solving Supervised Data Classification Problems

Supervised data classification is one of the techniques used to extract nontrivial information from data. Classification is a widely used technique in various fields, including data mining, industry, medicine, science, and law. This paper considers a new algorithm for supervised data classification problems associated with the cluster analysis. The mathematical formulations for this algorithm are based on nonsmooth, nonconvex optimization. A new algorithm for solving this optimization problem is utilized. The new algorithm uses a derivative-free technique, with robustness and efficiency. To improve classification performance and efficiency in generating classification model, a new feature selection algorithm based on techniques of convex programming is suggested. Proposed methods are tested on real-world datasets. Results of numerical experiments have been presented which demonstrate the effectiveness of the proposed algorithms.

Download Full-text

A new feature selection algorithm for two-class classification problems and application to endometrial cancer

2012 IEEE 51st IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc.2012.6426819 ◽

2012 ◽

Cited By ~ 10

Author(s):

M. Eren Ahsen ◽

Nitin K. Singh ◽

Todd Boren ◽

M. Vidyasagar ◽

Michael A. White

Keyword(s):

Feature Selection ◽

Endometrial Cancer ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Classification Problems ◽

New Feature

Download Full-text

A Novel Model for Imbalanced Data Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6145 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6680-6687

Author(s):

Jian Yin ◽

Chunjing Gan ◽

Kaiqi Zhao ◽

Xuan Lin ◽

Zhe Quan ◽

...

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Classification Performance ◽

Classification Model ◽

Proposed Model ◽

Imbalanced Data Classification ◽

Public Datasets ◽

Distribution Cost ◽

Novel Model ◽

Learning Data

Recently, imbalanced data classification has received much attention due to its wide applications. In the literature, existing researches have attempted to improve the classification performance by considering various factors such as the imbalanced distribution, cost-sensitive learning, data space improvement, and ensemble learning. Nevertheless, most of the existing methods focus on only part of these main aspects/factors. In this work, we propose a novel imbalanced data classification model that considers all these main aspects. To evaluate the performance of our proposed model, we have conducted experiments based on 14 public datasets. The results show that our model outperforms the state-of-the-art methods in terms of recall, G-mean, F-measure and AUC.

Download Full-text

A new feature selection algorithm for stream Data Classification

2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI) ◽

10.1109/icacci.2013.6637462 ◽

2013 ◽

Cited By ~ 2

Author(s):

Kapil Wankhade ◽

Dhiraj Rane ◽

Ravindra Thool

Keyword(s):

Feature Selection ◽

Data Classification ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Stream Data ◽

New Feature

Download Full-text

Feature Selection with Conditional Mutual Information Considering Feature Interaction

Symmetry ◽

10.3390/sym11070858 ◽

2019 ◽

Vol 11 (7) ◽

pp. 858 ◽

Cited By ~ 3

Author(s):

Jun Liang ◽

Liang Hou ◽

Zhenhua Luan ◽

Weiping Huang

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Classification Performance ◽

Feature Interaction ◽

Conditional Mutual Information ◽

Selection Algorithm ◽

Benchmark Datasets ◽

Feature Relevance ◽

New Feature ◽

Selection Algorithms

Feature interaction is a newly proposed feature relevance relationship, but the unintentional removal of interactive features can result in poor classification performance for this relationship. However, traditional feature selection algorithms mainly focus on detecting relevant and redundant features while interactive features are usually ignored. To deal with this problem, feature relevance, feature redundancy and feature interaction are redefined based on information theory. Then a new feature selection algorithm named CMIFSI (Conditional Mutual Information based Feature Selection considering Interaction) is proposed in this paper, which makes use of conditional mutual information to estimate feature redundancy and interaction, respectively. To verify the effectiveness of our algorithm, empirical experiments are conducted to compare it with other several representative feature selection algorithms. The results on both synthetic and benchmark datasets indicate that our algorithm achieves better results than other methods in most cases. Further, it highlights the necessity of dealing with feature interaction.

Download Full-text

Application of Data Denoising and Classification Algorithm Based on RPCA and Multigroup Random Walk Random Forest in Engineering

Mathematical Problems in Engineering ◽

10.1155/2019/7387398 ◽

2019 ◽

Vol 2019 ◽

pp. 1-15

Author(s):

Renchao Wang ◽

Yanlei Wang ◽

Yuming Ma

Keyword(s):

Random Walk ◽

Random Forest ◽

Gaussian Noise ◽

Hybrid Algorithm ◽

Principal Component ◽

Data Classification ◽

Classification Performance ◽

Classification Algorithms ◽

Classification Problems ◽

Non Gaussian

Data classification algorithms are often used in the engineering field, but the data measured in the actual engineering often contains different types and degrees of noise, such as vibration noise caused by water flow when measuring the natural frequencies of aqueducts or other hydraulic structures, which will affect the accuracy of classification. In reality, these noises often appear disorganized and stochastic and some existing algorithms exhibit poor performance in the face of these non-Gaussian noise. Therefore, the classification algorithms with excellent performance are needed. To address this issue, a hybrid algorithm of robust principal component analysis (RPCA) combined multigroup random walk random forest (MRWRF) is proposed in this paper. On the one hand RPCA can effectively remove part of non-Gaussian noise, and on the other hand MRWRF can select a better number of decision trees (DTs), which can effectively improve random forest (RF) robustness and classification performance, and the combination of RPCA and MRWRF can effectively classify data with non-Gaussian distribution noise. Compared with other existing algorithms, this hybrid algorithm has strong robustness and preferable classification performance and can thus provide a new approach for data classification problems in engineering.

Download Full-text

BHHO-TVS: A Binary Harris Hawks Optimizer with Time-Varying Scheme for Solving Data Classification Problems

Applied Sciences ◽

10.3390/app11146516 ◽

2021 ◽

Vol 11 (14) ◽

pp. 6516

Author(s):

Hamouda Chantar ◽

Thaer Thaher ◽

Hamza Turabieh ◽

Majdi Mafarja ◽

Alaa Sheta

Keyword(s):

Feature Selection ◽

Search Algorithm ◽

Gravitational Search Algorithm ◽

Data Classification ◽

Classification Model ◽

Model Complexity ◽

Binary Particle Swarm Optimization ◽

Time Varying ◽

Classification Problems ◽

Whale Optimization

Data classification is a challenging problem. Data classification is very sensitive to the noise and high dimensionality of the data. Being able to reduce the model complexity can help to improve the accuracy of the classification model performance. Therefore, in this research, we propose a novel feature selection technique based on Binary Harris Hawks Optimizer with Time-Varying Scheme (BHHO-TVS). The proposed BHHO-TVS adopts a time-varying transfer function that is applied to leverage the influence of the location vector to balance the exploration and exploitation power of the HHO. Eighteen well-known datasets provided by the UCI repository were utilized to show the significance of the proposed approach. The reported results show that BHHO-TVS outperforms BHHO with traditional binarization schemes as well as other binary feature selection methods such as binary gravitational search algorithm (BGSA), binary particle swarm optimization (BPSO), binary bat algorithm (BBA), binary whale optimization algorithm (BWOA), and binary salp swarm algorithm (BSSA). Compared with other similar feature selection approaches introduced in previous studies, the proposed method achieves the best accuracy rates on 67% of datasets.

Download Full-text

Learn the Highest Label and Rest Label Description Degrees

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/426 ◽

2021 ◽

Author(s):

Jing Wang ◽

Xin Geng

Keyword(s):

Theoretical Analysis ◽

Real World ◽

Classification Performance ◽

Experimental Results ◽

Classification Problems ◽

Large Margin ◽

Performance Deterioration ◽

Real World Datasets ◽

New Perspective ◽

Label Distribution

Although Label Distribution Learning (LDL) has found wide applications in varieties of classification problems, it may face the challenge of objective mismatch -- LDL neglects the optimal label for the sake of learning the whole label distribution, which leads to performance deterioration. To improve classification performance and solve the objective mismatch, we propose a new LDL algorithm called LDL-HR. LDL-HR provides a new perspective of label distribution, \textit{i.e.}, a combination of the \textbf{highest label} and the \textbf{rest label description degrees}. It works as follows. First, we learn the highest label by fitting the degenerated label distribution and large margin. Second, we learn the rest label description degrees to exploit generalization. Theoretical analysis shows the generalization of LDL-HR. Besides, the experimental results on 18 real-world datasets validate the statistical superiority of our method.

Download Full-text

An Improved Convolutional Neural Network for Text Classification

Journal of Physics Conference Series ◽

10.1088/1742-6596/2066/1/012091 ◽

2021 ◽

Vol 2066 (1) ◽

pp. 012091

Author(s):

Xiaojing Fan ◽

A Runa ◽

Zhili Pei ◽

Mingyang Jiang

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Text Classification ◽

Classification Accuracy ◽

Classification Performance ◽

Classification Model ◽

Classification Problems ◽

Training Time ◽

Accuracy And Precision ◽

And Training

Abstract This paper studies the text classification based on deep learning. Aiming at the problem of over fitting and training time consuming of CNN text classification model, a SDCNN model is constructed based on sparse dropout convolutional neural network. Experimental results show that, compared with CNN, SDCNN further improves the classification performance of the model, and its classification accuracy and precision can reach 98.96% and 85.61%, respectively, indicating that SDCNN has more advantages in text classification problems.

Download Full-text

Transformer Oil Quality Assessment Using Random Forest with Feature Engineering

Energies ◽

10.3390/en14071809 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1809

Author(s):

Mohammed El Amine Senoussaoui ◽

Mostefa Brahami ◽

Issouf Fofana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Oil Quality ◽

Principal Component ◽

Condition Assessment ◽

Classification Performance ◽

Transformer Oil ◽

Classification Model ◽

Insulation Degradation ◽

Transformer Oils

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text