Chinese Sentiment Classifier Machine Learning Based on Optimized Information Gain Feature Selection

Machine learning is important solution in the research of Chinese text sentiment categorization , the text feature selection is critical to the classification performance. However, the classical feature selection methods have better effect on the global categories, but it misses many representative feature words of each category. This paper presents an improved information gain method that integrates word frequency and degree of feature word sentiment into traditional information gain methods. Experiments show that classifier improved by this method has better classification .

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

An Empirical Evaluation of Feature Selection Methods

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch012 ◽

2015 ◽

pp. 233-258 ◽

Cited By ~ 1

Author(s):

Mohsin Iqbal ◽

Saif Ur Rehman ◽

Saira Gillani ◽

Sohail Asghar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Accuracy ◽

Information Gain ◽

Learning Algorithm ◽

Empirical Evaluation ◽

Machine Learning Algorithms ◽

Selection Methods ◽

The One ◽

Processing And Storage

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.

Download Full-text

Identifying Optimal Wavelengths as Disease Signatures Using Hyperspectral Sensor and Machine Learning

Remote Sensing ◽

10.3390/rs13142833 ◽

2021 ◽

Vol 13 (14) ◽

pp. 2833

Author(s):

Xing Wei ◽

Marcela A. Johnson ◽

David B. Langston ◽

Hillary L. Mehl ◽

Song Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Optical Sensors ◽

Minimum Distance ◽

Classification Performance ◽

Stem Rot ◽

Recursive Feature Elimination ◽

Agricultural Crop ◽

Selection Methods ◽

20 Nm

Hyperspectral sensors combined with machine learning are increasingly utilized in agricultural crop systems for diverse applications, including plant disease detection. This study was designed to identify the most important wavelengths to discriminate between healthy and diseased peanut (Arachis hypogaea L.) plants infected with Athelia rolfsii, the causal agent of peanut stem rot, using in-situ spectroscopy and machine learning. In greenhouse experiments, daily measurements were conducted to inspect disease symptoms visually and to collect spectral reflectance of peanut leaves on lateral stems of plants mock-inoculated and inoculated with A. rolfsii. Spectrum files were categorized into five classes based on foliar wilting symptoms. Five feature selection methods were compared to select the top 10 ranked wavelengths with and without a custom minimum distance of 20 nm. Recursive feature elimination methods outperformed the chi-square and SelectFromModel methods. Adding the minimum distance of 20 nm into the top selected wavelengths improved classification performance. Wavelengths of 501–505, 690–694, 763 and 884 nm were repeatedly selected by two or more feature selection methods. These selected wavelengths can be applied in designing optical sensors for automated stem rot detection in peanut fields. The machine-learning-based methodology can be adapted to identify spectral signatures of disease in other plant-pathogen systems.

Download Full-text

Stability Evaluation of Brain Changes in Parkinson's Disease Based on Machine Learning

Frontiers in Computational Neuroscience ◽

10.3389/fncom.2021.735991 ◽

2021 ◽

Vol 15 ◽

Author(s):

Chenggang Song ◽

Weidong Zhao ◽

Hong Jiang ◽

Xiaoju Liu ◽

Yumei Duan ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Brain Regions ◽

Classification Performance ◽

Group Level ◽

Selection Methods ◽

Individual Level ◽

Brain Changes ◽

The Brain ◽

Potential Biomarkers

Structural MRI (sMRI) has been widely used to examine the cerebral changes that occur in Parkinson's disease (PD). However, previous studies have aimed for brain changes at the group level rather than at the individual level. Additionally, previous studies have been inconsistent regarding the changes they identified. It is difficult to identify which brain regions are the true biomarkers of PD. To overcome these two issues, we employed four different feature selection methods [ReliefF, graph-theory, recursive feature elimination (RFE), and stability selection] to obtain a minimal set of relevant features and nonredundant features from gray matter (GM) and white matter (WM). Then, a support vector machine (SVM) was utilized to learn decision models from selected features. Based on machine learning technique, this study has not only extended group level statistical analysis with identifying group difference to individual level with predicting patients with PD from healthy controls (HCs), but also identified most informative brain regions with feature selection methods. Furthermore, we conducted horizontal and vertical analyses to investigate the stability of the identified brain regions. On the one hand, we compared the brain changes found by different feature selection methods and considered these brain regions found by feature selection methods commonly as the potential biomarkers related to PD. On the other hand, we compared these brain changes with previous findings reported by conventional statistical analysis to evaluate their stability. Our experiments have demonstrated that the proposed machine learning techniques achieve satisfactory and robust classification performance. The highest classification performance was 92.24% (specificity), 92.42% (sensitivity), 89.58% (accuracy), and 89.77% (AUC) for GM and 71.93% (specificity), 74.87% (sensitivity), 71.18% (accuracy), and 71.82% (AUC) for WM. Moreover, most brain regions identified by machine learning were consistent with previous findings, which means that these brain regions are related to the pathological brain changes characteristic of PD and can be regarded as potential biomarkers of PD. Besides, we also found the brain abnormality of superior frontal gyrus (dorsolateral, SFGdor) and lingual gyrus (LING), which have been confirmed in other studies of PD. This further demonstrates that machine learning models are beneficial for clinicians as a decision support system in diagnosing PD.

Download Full-text

Performance Assessment of Multiple Classifiers Based on Ensemble Feature Selection Scheme for Sentiment Analysis

Applied Computational Intelligence and Soft Computing ◽

10.1155/2018/8909357 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 4

Author(s):

Monalisa Ghosh ◽

Goutam Sanyal

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Gini Index ◽

Feature Vector ◽

Information Gain ◽

Feature Subset ◽

Selection Methods ◽

Prominent Feature ◽

Chi Square

Sentiment classification or sentiment analysis has been acknowledged as an open research domain. In recent years, an enormous research work is being performed in these fields by applying various numbers of methodologies. Feature generation and selection are consequent for text mining as the high-dimensional feature set can affect the performance of sentiment analysis. This paper investigates the inability or incompetency of the widely used feature selection methods (IG, Chi-square, and Gini Index) with unigram and bigram feature set on four machine learning classification algorithms (MNB, SVM, KNN, and ME). The proposed methods are evaluated on the basis of three standard datasets, namely, IMDb movie review and electronics and kitchen product review dataset. Initially, unigram and bigram features are extracted by applying n-gram method. In addition, we generate a composite features vector CompUniBi (unigram + bigram), which is sent to the feature selection methods Information Gain (IG), Gini Index (GI), and Chi-square (CHI) to get an optimal feature subset by assigning a score to each of the features. These methods offer a ranking to the features depending on their score; thus a prominent feature vector (CompIG, CompGI, and CompCHI) can be generated easily for classification. Finally, the machine learning classifiers SVM, MNB, KNN, and ME used prominent feature vector for classifying the review document into either positive or negative. The performance of the algorithm is measured by evaluation methods such as precision, recall, and F-measure. Experimental results show that the composite feature vector achieved a better performance than unigram feature, which is encouraging as well as comparable to the related research. The best results were obtained from the combination of Information Gain with SVM in terms of highest accuracy.

Download Full-text