scholarly journals Pearson’s Redundancy Multi-Filtering with BAT Algorithm for Selecting High Dimensional Imbalanced Features

Author(s):  
Ala Saleh Alluhaidan ◽  
Prabu P ◽  
Sivakumar R

Abstract Feature selection plays a vital role for every data analysis application. Feature selection aims to choose prominent set of features after removing redundant and irrelevant features from original set of features. High Dimensional dataset poses a challenging task for Machine Learning algorithms. Many state-of-art solutions were developed to handle this issue. High dimensionality in addition to imbalance ratio in the dataset becomes a tedious task. To overcome the issue, this paper introduces a novel method namely Pearson’s Redundancy Based Multi Filter algorithm with improved BAT algorithm (PRBMF-iBAT) to obtain multiple feature subsets. PRBMF is implemented using multiple filters to obtain highly relevant features. iBAT algorithm uses these features to find best subset of features for classification. The results prove that PRBMF-iBAT perform better for the classifier in terms of Accuracy, Precision, Recall and F- Measure for three micro array datasets with SVM classifier. The proposed system achieves 97.99% of accuracy as highest compared to the existing rCBR-BGOA algorithm.

2020 ◽  
pp. 3397-3407
Author(s):  
Nur Syafiqah Mohd Nafis ◽  
Suryanti Awang

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.


2021 ◽  
Vol 36 (1) ◽  
pp. 721-726
Author(s):  
S. Mahesh ◽  
Dr.G. Ramkumar

Aim: Machine learning algorithm plays a vital role in various biometric applications due to its admirable result in detection, recognition and classification. The main objective of this work is to perform comparative analysis on two different machine learning algorithms to recognize the person from low resolution images with high accuracy. Materials & Methods: AlexNet Convolutional Neural Network (ACNN) and Support Vector Machine (SVM) classifiers are implemented to recognize the face in a low resolution image dataset with 20 samples each. Results: Simulation result shows that ACNN achieves a significant recognition rate with 98% accuracy over SVM (89%). Attained significant accuracy ratio (p=0.002) in SPSS statistical analysis as well. Conclusion: For the considered low resolution images ACNN classifier provides better accuracy than SVM Classifier.


2019 ◽  
Vol 8 (4) ◽  
pp. 6140-6144

In this work, we propose a prospective novel method to address illumination invariant system for facial expression recognition. Facial expressions are used to convey nonverbal visual information among humans. This also plays a vital role in human-machine interface modules that have invoked attention of many researchers. Earlier machine learning algorithms require complex feature extraction algorithms and are relying on the size and uniqueness of features related to the subjects. In this paper, a deep convolutional neural network is proposed for facial expression recognition and it is trained on two publicly available datasets such as JAFFE and Yale databases under different illumination conditions. Furthermore, transfer learning is used with pre-trained networks such as AlexNet and ResNet-101 trained on ImageNet database. Experimental results show that the designed network could recognize up to 30% variation in the illumination and it achieves an accuracy of 92%.


2021 ◽  
pp. 1-10
Author(s):  
Diwakar Tripathi ◽  
B. Ramachandra Reddy ◽  
Y.C.A. Padmanabha Reddy ◽  
Alok Kumar Shukla ◽  
Ravi Kant Kumar ◽  
...  

Credit scoring plays a vital role for financial institutions to estimate the risk associated with a credit applicant applied for credit product. It is estimated based on applicants’ credentials and directly affects to viability of issuing institutions. However, there may be a large number of irrelevant features in the credit scoring dataset. Due to irrelevant features, the credit scoring models may lead to poorer classification performances and higher complexity. So, by removing redundant and irrelevant features may overcome the problem with large number of features. In this work, we emphasized on the role of feature selection to enhance the predictive performance of credit scoring model. Towards to feature selection, Binary BAT optimization technique is utilized with a novel fitness function. Further, proposed approach aggregated with “Radial Basis Function Neural Network (RBFN)”, “Support Vector Machine (SVM)” and “Random Forest (RF)” for classification. Proposed approach is validated on four bench-marked credit scoring datasets obtained from UCI repository. Further, the comprehensive investigational results analysis are directed to show the comparative performance of the classification tasks with features selected by various approaches and other state-of-the-art approaches for credit scoring.


Author(s):  
Yasaswini V. ◽  
Santhi Baskaran

Data mining is the action of searching the large existing database in order to get new and best information. It plays a major and vital role now-a-days in all sorts of fields like Medical, Engineering, Banking, Education and Fraud detection. In this paper Feature selection which is a part of Data mining is performed to do classification. The role of feature selection is in the context of deep learning and how it is related to feature engineering. Feature selection is a preprocessing technique which selects the appropriate features from the data set to get the accurate result and outcome for the classification. Natureinspired Optimization algorithms like Ant colony, Firefly, Cuckoo Search and Harmony Search showed better performance by giving the best accuracy rate with less number of features selected and also fine f-Measure value is noted. These algorithms are used to perform classification that accurately predicts the target class for each case in the data set. We propose a technique to get the optimized feature selection to perform classification using Meta Heuristic algorithms. We applied new and recent advanced optimized algorithm named Bat algorithm on UCI datasets that showed comparatively equal results with best performed existing firefly but with less number of features selected. The work is implemented using JAVA and the Medical dataset (UCI) has been used. These datasets were chosen due to nominal class features. The number of attributes, instances and classes varies from chosen dataset to represent different combinations. Classification is done using J48 classifier in WEKA tool. We demonstrate the comparative results of the presently used algorithms with the existing algorithms thoroughly.


2021 ◽  
Vol 9 (2) ◽  
pp. 113-123
Author(s):  
T. Mathi Murugan ◽  
◽  
E. Baburaj ◽  

The classification of high-dimensional dataset is challenging as it contains large amount irrelevant and noisy features. Thus, feature selection is performed in the dataset to eliminate these redundant features. It reduces the dimensionality of the dataset and increases the classification accuracy. Hence, for selecting the relevant features in high dimensional data, an improved cuckoo search algorithm (ICSA) was proposed in this paper. After feature selection, the dataset undergo classification using KNN classifier and SVM classifier. The experimental process illustrates that the improved cuckoo search algorithm effectively increases the classification accuracy by reducing the number of features in the dataset. For analysing the proposed algorithm, seven UCI repository dataset have been utilised. Also, the ICS algorithm is compared with other existing algorithms for the given dataset. From the investigation process, it was concluded that the proposed algorithm selects lesser number of features and also enhances the classification accuracy than the other existing algorithms.


Author(s):  
S. Appavu alias Balamurugan ◽  
S. Gilbert Nancy

Feature selection is the process of identifying and removing many irrelevant and redundant features. Irrelevant features, along with redundant features, severely affect the accuracy of the learning machines. In high dimensional space finding clusters of data objects is challenging due to the curse of dimensionality. When the dimensionality increases, data in the irrelevant dimensions may produce much noise. And also, time complexity is the major issues in existing approach. In order to rectify these issues our proposed method made use of efficient feature subset selection in high dimensional data. Here we are considering the input dataset is the high dimensional micro array dataset. Initially, we have to select the optimal features so that our proposed technique employed Modified Social Spider Optimization (MSSO) algorithm. Here the traditional Social Spider Optimization is modified with the help of fruit fly optimization algorithm. Next the selected features are the input for the classifier. Here the classification is performed using Optimized Radial basis Function based neural network (ORBFNN) technique to classify the micro array data as normal or abnormal data. The effectiveness of RBFNN is optimized by means of artificial bee colony algorithm (ABC). Experimental results indicate that the proposed classification framework have outperformed by having better accuracy for five benchmark dataset 93.66%, 97.09%, 98.66%, 98.28% and 98.93% which is minimum value when compared to the existing technique. The proposed method is executed in MATLAB platform.


Author(s):  
Vaishali Arya ◽  
Rashmi Agrawal

Aims: Feature Selection Techniques for Text Data Composed of Heterogeneous sources for sentiment classification. Objectives: The objective of work is to analyze the feature selection technique for text gathered from different sources to increase the accuracy of sentiment classification done on microblogs. Methods: Applied three feature selection techniques Bag-of-Word(BOW), TF-IDF, and word2vector to find the most suitable feature selection techniques for heterogeneous datasets. Results: TF-IDF outperforms outh of the three selected feature selection technique for sentiment classification with SVM classifier. Conclusion: Feature selection is an integral part of any data preprocessing task, and along with that, it is also important for the machine learning algorithms in achieving good accuracy in classification results. Hence it is essential to find out the best suitable approach for heterogeneous sources of data. The heterogeneous sources are rich sources of information and they also play an important role in developing a model for adaptable systems as well. So keeping that also in mind we have compared the three techniques for heterogeneous source data and found that TF-IDF is the most suitable one for all types of data whether it is balanced or imbalanced data, it is a single source or multiple source data. In all cases, TF-IDF approach is the most promising approach in generating the results for the classification of sentiments of users.


Sign in / Sign up

Export Citation Format

Share Document