A New Big Data Feature Selection Approach for Text Classification

Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naïve Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.

Download Full-text

Multiple similarly effective solutions exist for biomedical feature selection and classification problems

Scientific Reports ◽

10.1038/s41598-017-13184-8 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 9

Author(s):

Jiamei Liu ◽

Cheng Xu ◽

Weifeng Yang ◽

Yayun Shu ◽

Weiwei Zheng ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Association Studies ◽

Binary Classification ◽

Learning Algorithms ◽

Optimal Solution ◽

Machine Learning Algorithms ◽

Disease Classification ◽

Genome Wide Association Studies ◽

Classification Problems

Abstract Binary classification is a widely employed problem to facilitate the decisions on various biomedical big data questions, such as clinical drug trials between treated participants and controls, and genome-wide association studies (GWASs) between participants with or without a phenotype. A machine learning model is trained for this purpose by optimizing the power of discriminating samples from two groups. However, most of the classification algorithms tend to generate one locally optimal solution according to the input dataset and the mathematical presumptions of the dataset. Here we demonstrated from the aspects of both disease classification and feature selection that multiple different solutions may have similar classification performances. So the existing machine learning algorithms may have ignored a horde of fishes by catching only a good one. Since most of the existing machine learning algorithms generate a solution by optimizing a mathematical goal, it may be essential for understanding the biological mechanisms for the investigated classification question, by considering both the generated solution and the ignored ones.

Download Full-text

Intrusion Detection System using SMIFS and Multi class Multi layer Perceptron

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i8982.078919 ◽

2019 ◽

Vol 8 (9) ◽

pp. 2622-2628

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Mutual Information ◽

New Technologies ◽

Detection System ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Feature Subset ◽

Classification Problems ◽

Data Set

As the new technologies are emerging, data is getting generated in larger volumes high dimensions. The high dimensionality of data may rise to great challenge while classification. The presence of redundant features and noisy data degrades the performance of the model. So, it is necessary to extract the relevant features from given data set. Feature extraction is an important step in many machine learning algorithms. Many researchers have been attempted to extract the features. Among these different feature extraction methods, mutual information is widely used feature selection method because of its good quality of quantifying dependency among the features in classification problems. To cope with this issue, in this paper we proposed simplified mutual information based feature selection with less computational overhead. The selected feature subset is experimented with multilayered perceptron on KDD CUP 99 data set with 2- class classification, 5-class classification and 4-class classification. The accuracy is of these models almost similar with less number of features.

Download Full-text

Feature Selection with Limited Bit Depth Mutual Information for Embedded Systems

Proceedings ◽

10.3390/proceedings2181187 ◽

2018 ◽

Vol 2 (18) ◽

pp. 1187

Author(s):

Laura Morán-Fernández ◽

Verónica Bolón-Canedo ◽

Amparo Alonso-Betanzos

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Embedded Systems ◽

Mutual Information ◽

Machine Learning Algorithms ◽

Numerical Representation ◽

Measures Of Dependence ◽

The Times ◽

Computational Resources ◽

Selection Algorithms

Data is growing at an unprecedented pace. With the variety, speed and volume of data flowing through networks and databases, newer approaches based on machine learning are required. But what is really big in Big Data? Should it depend on the numerical representation of the machine? Since portable embedded systems have been growing in importance, there is also increased interest in implementing machine learning algorithms with a limited number of bits. Not only learning, also feature selection, most of the times a mandatory preprocessing step in machine learning, is often constrained by the available computational resources. In this work, we consider mutual information—one of the most common measures of dependence used in feature selection algorithms—with reduced precision parameters.

Download Full-text

Improved Feature Selection Based on Mutual Information for Regression Tasks

Journal of IT in Asia ◽

10.33736/jita.330.2016 ◽

2016 ◽

Vol 6 (1) ◽

pp. 11-24

Author(s):

Muhammad A. Sulaiman ◽

Jane Labadin

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Mutual Information ◽

Greedy Algorithms ◽

Selection Criterion ◽

Feature Subset ◽

Classification Problems ◽

Machine Learning Applications ◽

Optimal Feature Subset ◽

Optimal Feature

Mutual Information (MI) is an information theory concept often used in the recent time as a criterion for feature selection methods. This is due to its ability to capture both linear and non-linear dependency relationships between two variables. In theory, mutual information is formulated based on probability density functions (pdfs) or entropies of the two variables. In most machine learning applications, mutual information estimation is formulated for classification problems (that is data with labeled output). This study investigates the use of mutual information estimation as a feature selection criterion for regression tasks and introduces enhancement in selecting optimal feature subset based on previous works. Specifically, while focusing on regression tasks, it builds on the previous work in which a scientifically sound stopping criteria for feature selection greedy algorithms was proposed. Four real-world regression datasets were used in this study, three of the datasets are public obtained from UCI machine learning repository and the remaining one is a private well log dataset. Two Machine learning models namely multiple regression and artificial neural networks (ANN) were used to test the performance of IFSMIR. The results obtained has proved the effectiveness of the proposed method.

Download Full-text

Feature Selection based Arabic Text Classification using Different Machine Learning Algorithms

Proceedings of the 10th International Conference on Information Systems and Technologies ◽

10.1145/3447568.3448531 ◽

2020 ◽

Author(s):

Sakina Rim Bennabi ◽

Zakaria Elberrichi

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Text Classification ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Arabic Text ◽

Arabic Text Classification

Download Full-text

Efficient Approach of Automatic Speech Emotion Recognition (ASR) Using Mutual Information

INFORMATION TECHNOLOGY IN INDUSTRY ◽

10.17762/itii.v9i1.177 ◽

2021 ◽

Vol 9 (1) ◽

pp. 595-603

Author(s):

Shivangi Srivastav, Rajiv Ranjan Tewari

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Mutual Information ◽

Speaker Recognition ◽

Speaker Identification ◽

Learning Algorithms ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Human Interaction ◽

Speech Emotion Recognition

Speech is a significant quality for distinguishing a person in daily human to human interaction/ communication. Like other biometric measures, such as face, iris and fingerprints, voice can therefore be used as a biometric measure for perceiving or identifying the person. Speaker recognition is almost the same as a kind of voice recognition in which the speaker is identified from the expression instead of the message. Automatic Speaker Recognition (ASR) is the way to identify people who rely on highlights that are omitted from speech expressions. Speech signals are awesome correspondence media that constantly pass on rich and useful knowledge, such as a speaker's feeling, sexual orientation, complement, and other interesting attributes. In any speaker identification, the essential task is to delete helpful highlights and allow for significant examples of speaker models. Hypothetical description, organization of the full state of feeling and the modalities of articulation of feeling are added. A SER framework is developed to conduct this investigation, in view of different classifiers and different techniques for extracting highlights. In this work various machine learning algorithms are investigated to identify decision boundary in feature space of audio signals. Moreover novelty of this art lies in improving the performance of classical machine learning algorithms using information theory based feature selection methods. The higher accuracy retrieved is 96 percent using Random forest algorithm incorporated with Joint Mutual information feature selection method.

Download Full-text

Feature Selection for Machine Learning in Big Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f1067.0486s419 ◽

2019 ◽

Vol 8 (6S4) ◽

pp. 332-335

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Big Data ◽

Data Analytics ◽

Learning Algorithm ◽

Information Age ◽

Machine Learning Algorithms ◽

Fine Grained ◽

Selection For ◽

Selection Mechanisms

We are in the information age there by collecting very huge volume of data from diverse sources in structured, unstructured and semi structured form ranging to petabytes to exabytes of data. Data is an asset as valuable knowledge and information is hidden in such massive volumes of data. Data analytics is required to have a deeper insights and identify fine grained patterns so as to make accurate predictions enabling the improvement of decision making. Extracting knowledge from data is done by data analytics, Machine learning forms the core of it. The increase in the dimensionality of data both in terms of number of tuples and also in terms of number of features poses several challenges to the machine learning algorithms . Preprocessing of data is done as a prior step to machine learning, so feature selection is done as a preprocessing step to have the dimensionality reduction of the data and thereby removing the irrelevant features and improving the efficiency and accuracy of a machine learning algorithm. In this paper we are studying various feature selection mechanisms and analyze them whether they can be adopted to sentiment analysis of big data.

Download Full-text

A Comparative Study of Different Machine Learning Algorithms for Disease Prediction

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0177 ◽

2017 ◽

Vol 7 (7) ◽

pp. 172

Author(s):

Anantvir Singh Romana

Keyword(s):

Machine Learning ◽

Subsequent Treatment ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Disease Prediction ◽

Classification Problems ◽

Learning Techniques ◽

Neural Network Classifiers ◽

Diagnostic Detection

Accurate diagnostic detection of the disease in a patient is critical and may alter the subsequent treatment and increase the chances of survival rate. Machine learning techniques have been instrumental in disease detection and are currently being used in various classification problems due to their accurate prediction performance. Various techniques may provide different desired accuracies and it is therefore imperative to use the most suitable method which provides the best desired results. This research seeks to provide comparative analysis of Support Vector Machine, Naïve bayes, J48 Decision Tree and neural network classifiers breast cancer and diabetes datsets.

Download Full-text