scholarly journals Hybrid Feature Selection for Amharic News Document Classification

2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Demeke Endalie ◽  
Getamesay Haile

Today, the amount of Amharic digital documents has grown rapidly. Because of this, automatic text classification is extremely important. Proper selection of features has a crucial role in the accuracy of classification and computational time. When the initial feature set is considerably larger, it is important to pick the right features. In this paper, we present a hybrid feature selection method, called IGCHIDF, which consists of information gain (IG), chi-square (CHI), and document frequency (DF) features’ selection methods. We evaluate the proposed feature selection method on two datasets: dataset 1 containing 9 news categories and dataset 2 containing 13 news categories. Our experimental results showed that the proposed method performs better than other methods on both datasets 1and 2. The IGCHIDF method’s classification accuracy is up to 3.96% higher than the IG method, up to 11.16% higher than CHI, and 7.3% higher than DF on dataset 2, respectively.

2010 ◽  
Vol 9 ◽  
pp. CIN.S3794 ◽  
Author(s):  
Xiaosheng Wang ◽  
Osamu Gotoh

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.


Author(s):  
GULDEN UCHYIGIT ◽  
KEITH CLARK

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.


Author(s):  
Atchara Choompol ◽  
Panida Songram ◽  
Phattahanaphong Chomphuwiset

Most of the opinion comments on social networks are short and ambiguous. In general, opinion classification on the comments is difficult because of lacking dominant features. A feature extraction technique is therefore necessary for improving accuracy of the classification and computational time. This paper proposes an effective feature selection method for opinion classification on a social network. The proposed method selects features based on the concept of a filter model, together with association rules. Support and confidence are used to calculate the weights of features. The features with high weight are selected for classification. Unlike supports in association rules, supports in our method are normalized to 0-1 to remove outlier supports. Moreover, a tuning parameter is used to emphasize the degree of support or confidence. The experimental results show that the proposed method provides high classification efficiency. The proposed method outperforms Information Gain, Chi-Square, and Gini Index in both computational time and accuracy.


2016 ◽  
Vol 78 (8-2) ◽  
Author(s):  
Jafreezal Jaafar ◽  
Zul Indra ◽  
Nurshuhaini Zamin

Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014 – 2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63% for language identification, and 97.5%% for category classification. While the category classifier works optimally on n = 60%, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification.


2019 ◽  
Vol 9 (3) ◽  
pp. 437 ◽  
Author(s):  
Shen Su ◽  
Yanbin Sun ◽  
Xiangsong Gao ◽  
Jing Qiu ◽  
Zhihong Tian

Selecting the right features for further data analysis is important in the process of equipment anomaly detection, especially when the origin data source involves high dimensional data with a low value density. However, existing researches failed to capture the fact that the sensor data are usually correlated (e.g., duplicated deployed sensors), and the correlations would be broken when anomalies occur with happen to the monitored equipment. In this paper, we propose to capture such sensor data correlation changes to improve the performance of IoT (Internet of Things) equipment anomaly detection. In our feature selection method, we first cluster correlated sensors together to recognize the duplicated deployed sensors according to sensor data correlations, and we monitor the data correlation changes in real time to select the sensors with correlation changes as the representative features for anomaly detection. To that end, (1) we conducted curve alignment for the sensor clustering; (2) we discuss the appropriate window size for data correlation calculation; (3) and adopted MCFS (Multi-Cluster Feature Selection) into our method to adapt to the online feature selection scenario. According to the experiment evaluation derived from real IoT equipment, we prove that our method manages to reduce the false negative of IoT equipment anomaly detection of 30% with almost the same level of false positive.


2017 ◽  
Vol 2017 ◽  
pp. 1-10 ◽  
Author(s):  
Khairan D. Rajab

Phishing is one of the serious web threats that involves mimicking authenticated websites to deceive users in order to obtain their financial information. Phishing has caused financial damage to the different online stakeholders. It is massive in the magnitude of hundreds of millions; hence it is essential to minimize this risk. Classifying websites into “phishy” and legitimate types is a primary task in data mining that security experts and decision makers are hoping to improve particularly with respect to the detection rate and reliability of the results. One way to ensure the reliability of the results and to enhance performance is to identify a set of related features early on so the data dimensionality reduces and irrelevant features are discarded. To increase reliability of preprocessing, this article proposes a new feature selection method that combines the scores of multiple known methods to minimize discrepancies in feature selection results. The proposed method has been applied to the problem of website phishing classification to show its pros and cons in identifying relevant features. Results against a security dataset reveal that the proposed preprocessing method was able to derive new features datasets which when mined generate high competitive classifiers with reference to detection rate when compared to results obtained from other features selection methods.


Sign in / Sign up

Export Citation Format

Share Document