English Text Classification Using Improved Recursive Feature Elimination (IRFE) Algorithm: تصنيف النص الإنجليزي باستخدام الخوارزمية العودية المحسنة لإزالة الخواص (IRFE)

Esraa H. Abd Al-Ameer, Ahmed H. Aliwy

doi:10.26389/ajsrp.r080420

English Text Classification Using Improved Recursive Feature Elimination (IRFE) Algorithm: تصنيف النص الإنجليزي باستخدام الخوارزمية العودية المحسنة لإزالة الخواص (IRFE)

Journal of engineering sciences and information technology - مجلة العلوم الهندسية و تكنولوجيا المعلومات ◽

10.26389/ajsrp.r080420 ◽

2020 ◽

Vol 4 (2) ◽

Author(s):

Esraa H. Abd Al-Ameer, Ahmed H. Aliwy

Keyword(s):

Feature Selection ◽

Language Processing ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

English Text ◽

Recursive Feature Elimination ◽

Chi Square ◽

Data Set ◽

New Feature

Documents classification is from most important fields for Natural language processing and text mining. There are many algorithms can be used for this task. In this paper, focuses on improving Text Classification by feature selection. This means determine some of the original features without affecting the accuracy of the work, where our work is a new feature selection method was suggested which can be a general formulation and mathematical model of Recursive Feature Elimination (RFE). The used method was compared with other two well-known feature selection methods: Chi-square and threshold. The results proved that the new method is comparable with the other methods, The best results were 83% when 60% of features used, 82% when 40% of features used, and 82% when 20% of features used. The tests were done with the Naïve Bayes (NB) and decision tree (DT) classification algorithms , where the used dataset is a well-known English data set “20 newsgroups text” consists of approximately 18846 files. The results showed that our suggested feature selection method is comparable with standard Like Chi-square.

Download Full-text

A NEW FEATURE SELECTION METHOD FOR TEXT CLASSIFICATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005466 ◽

2007 ◽

Vol 21 (02) ◽

pp. 423-438 ◽

Cited By ~ 9

Author(s):

GULDEN UCHYIGIT ◽

KEITH CLARK

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Selection Method ◽

Feature Space ◽

Selection Method ◽

Computational Time ◽

Small Subset ◽

Selection Methods ◽

New Feature

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.

Download Full-text

A Chi-Square Statistics Based Feature Selection Method in Text Classification

2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS) ◽

10.1109/icsess.2018.8663882 ◽

2018 ◽

Cited By ~ 3

Author(s):

Yujia Zhai ◽

Wei Song ◽

Xianjun Liu ◽

Lizhen Liu ◽

Xinlei Zhao

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

Chi Square

Download Full-text

A new feature selection method for handling redundant information in text classification

Frontiers of Information Technology & Electronic Engineering ◽

10.1631/fitee.1601761 ◽

2018 ◽

Vol 19 (2) ◽

pp. 221-234 ◽

Cited By ~ 6

Author(s):

You-wei Wang ◽

Li-zhou Feng

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

Redundant Information ◽

New Feature

Download Full-text

Recursive Feature Elimination with Ridge Regression (L2) Machine Learning Hybrid Feature Selection Algorithm for Diabetic Prediction using Random Forest Classifer.

10.21203/rs.3.rs-742641/v1 ◽

2021 ◽

Author(s):

K venkatachalam ◽

P Prabhu ◽

B saravana Balaji ◽

Mohamed Abouhawwash ◽

R Rajadevi

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Ridge Regression ◽

Feature Selection Method ◽

Selection Method ◽

Recursive Feature Elimination ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Data Set

Abstract In day today life, diabetes illness is increasing in count due to the body not able to metabolize the glucose level. The prediction of the right diabetes patients is an important research area that many researchers are proposing the techniques to predict this disease through data mining and machine learning methods. In prediction, feature selection is one of the key concept in preprocessing so that the features that are relevant to the disease will be used for prediction. This will improve the prediction accuracy. Selecting right features among the whole feature set is a complicated process and many researchers are concentrating on it to produce the predictive model with high accuracy. In this proposed work, the wrapper based feature selection method called Recursive Feature Elimination (RFE) is combined with Ridge regression (L2) to form a hybrid L2 regulated feature selection algorithm to overcome the overfilling problem of the data set. Over fitting is the major problem in feature selection which means that the new data are not fit to the model since the training data is small. Ridge regression is mainly used to overcome the overfitting problem. Once the features are selected using the proposed feature selection method, random forest classifier is used to classify the data based on the selected features. The proposed work is experimented in PIDD data set and the evaluated results are compared with the existing algorithms to prove the accuracy effect of the proposed algorithm. From the results obtained by proposed algorithm, the accuracy of predicting the diabetes disease is high compared to other existing algorithms.

Download Full-text

A new feature selection method based on distributional information for Text Classification

2010 IEEE International Conference on Progress in Informatics and Computing ◽

10.1109/pic.2010.5687404 ◽

2010 ◽

Author(s):

Nianyun Shi ◽

Lingling Liu

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

Distributional Information ◽

New Feature

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

Radiogenomic modeling predicts survival-associated prognostic groups in glioblastoma

Neuro-Oncology Advances ◽

10.1093/noajnl/vdab004 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

Abdullah Feroze ◽

Eric C Holland ◽

Linda Shapiro ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Area Under The Curve ◽

Selection Method ◽

Recursive Feature Elimination ◽

Signal Abnormality ◽

Mri Features ◽

Mri Scans

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.

Download Full-text

Design of novel multi filter union feature selection framework for breast cancer dataset

Concurrent Engineering ◽

10.1177/1063293x211016046 ◽

2021 ◽

pp. 1063293X2110160

Author(s):

Dinesh Morkonda Gunasekaran ◽

Prabha Dhandayudam

Keyword(s):

Breast Cancer ◽

Feature Selection ◽

Care Center ◽

Feature Selection Method ◽

Selection Method ◽

Cancer Center ◽

Breast Cancer Dataset ◽

Data Set ◽

Health Care Center ◽

Cancer Data

Nowadays women are commonly diagnosed with breast cancer. Feature based Selection method plays an important step while constructing a classification based framework. We have proposed Multi filter union (MFU) feature selection method for breast cancer data set. The feature selection process based on random forest algorithm and Logistic regression (LG) algorithm based union model is used for selecting important features in the dataset. The performance of the data analysis is evaluated using optimal features subset from selected dataset. The experiments are computed with data set of Wisconsin diagnostic breast cancer center and next the real data set from women health care center. The result of the proposed approach shows high performance and efficient when comparing with existing feature selection algorithms.

Download Full-text

A New Feature Selection Method for Enhancing Cancer Diagnosis Based on DNA Microarray

2020 37th National Radio Science Conference (NRSC) ◽

10.1109/nrsc49500.2020.9235095 ◽

2020 ◽

Author(s):

Mostafa Atlam ◽

Hanaa Torkey ◽

Hanaa Salem ◽

Nawal El-Fishawy

Keyword(s):

Feature Selection ◽

Dna Microarray ◽

Cancer Diagnosis ◽

Feature Selection Method ◽

Selection Method ◽

New Feature

Download Full-text

A Robust Gene selection Method for Microarray-based Cancer Classification

Cancer Informatics ◽

10.4137/cin.s3794 ◽

2010 ◽

Vol 9 ◽

pp. CIN.S3794 ◽

Cited By ~ 21

Author(s):

Xiaosheng Wang ◽

Osamu Gotoh

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Information Gain ◽

Expression Profiles ◽

Feature Selection Method ◽

Gene Expression Profiles ◽

Molecular Classification ◽

Selection Method ◽

Chi Square

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.

Download Full-text