Sentiment Analysis of Chinese Micro Blog Using Machine Learning and an Improved Feature Selection Method

2014 ◽  
Vol 631-632 ◽  
pp. 1219-1223
Author(s):  
Jia Hao Chen ◽  
Jian Hua Wu

With the rapid development of Internet and occurrence of social media services, many users are becoming the creators of social information. However, the normal manual work can't deal with a large number of subjective messages. As a new kind of social media service, micro blog has been widely accepted and can be used for sentiment analysis. This paper compared performances of three machine learning methods on sentiment analysis of Chinese micro blog. We also proposed an improved feature selection method that increases the accuracy of classification. Experiment results show that SVM is closed to Naïve Bayes and they are better than logistic regression in most cases.

2020 ◽  
Vol 4 (1) ◽  
pp. 29
Author(s):  
Sasan Sarbast Abdulkhaliq ◽  
Aso Mohammad Darwesh

Nowadays, people from every part of the world use social media and social networks to express their feelings toward different topics and aspects. One of the trendiest social media is Twitter, which is a microblogging website that provides a platform for its users to share their views and feelings about products, services, events, etc., in public. Which makes Twitter one of the most valuable sources for collecting and analyzing data by researchers and developers to reveal people sentiment about different topics and services, such as products of commercial companies, services, well-known people such as politicians and athletes, through classifying those sentiments into positive and negative. Classification of people sentiment could be automated through using machine learning algorithms and could be enhanced through using appropriate feature selection methods. We collected most recent tweets about (Amazon, Trump, Chelsea FC, CR7) using Twitter-Application Programming Interface and assigned sentiment score using lexicon rule-based approach, then proposed a machine learning model to improve classification accuracy through using hybrid feature selection method, namely, filter-based feature selection method Chi-square (Chi-2) plus wrapper-based binary coordinate ascent (Chi-2 + BCA) to select optimal subset of features from term frequency-inverse document frequency (TF-IDF) generated features for classification through support vector machine (SVM), and Bag of words generated features for logistic regression (LR) classifiers using different n-gram ranges. After comparing the hybrid (Chi-2+BCA) method with (Chi-2) selected features, and also with the classifiers without feature subset selection, results show that the hybrid feature selection method increases classification accuracy in all cases. The maximum attained accuracy with LR is 86.55% using (1 + 2 + 3-g) range, with SVM is 85.575% using the unigram range, both in the CR7 dataset.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1226
Author(s):  
Saeed Najafi-Zangeneh ◽  
Naser Shams-Gharneh ◽  
Ali Arjomandi-Nezhad ◽  
Sarfaraz Hashemkhani Zolfani

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


2020 ◽  
Vol 22 (Supplement_2) ◽  
pp. ii158-ii158
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
James Fink ◽  
David Haynor ◽  
Eric Holland ◽  
...  

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
Abdullah Feroze ◽  
Eric C Holland ◽  
Linda Shapiro ◽  
...  

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.


Sign in / Sign up

Export Citation Format

Share Document