Building a Sentiment Analysis System Using Automatically Generated Training Dataset

In this paper, we describe a methodology to develop a large training set for sentiment analysis automatically. We extract Arabic tweets and then annotates them for negativeness and positiveness sentiment without human intervention. These annotated tweets are used as a training data set to build our experimental sentiment analysis by using Naive Bayes algorithm and TF-IDF enhancement. The large size of training data for a highly inflected language is necessary to compensate for the sparseness nature of such languages. We present our techniques and explain our experimental system. We use 200 thousand annotated tweets to train our system. The evaluation shows that our sentiment analysis system has high precision and accuracy measures compared to existing ones.

Download Full-text

A sentiment analysis system for social media using machine learning techniques: Social enablement

Digital Scholarship in the Humanities ◽

10.1093/llc/fqy037 ◽

2018 ◽

Vol 34 (3) ◽

pp. 569-581 ◽

Cited By ~ 1

Author(s):

Sujata Rani ◽

Parteek Kumar

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Media Analysis ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Analysis Tool ◽

Data Set ◽

Learning Techniques

Abstract In this article, an innovative approach to perform the sentiment analysis (SA) has been presented. The proposed system handles the issues of Romanized or abbreviated text and spelling variations in the text to perform the sentiment analysis. The training data set of 3,000 movie reviews and tweets has been manually labeled by native speakers of Hindi in three classes, i.e. positive, negative, and neutral. The system uses WEKA (Waikato Environment for Knowledge Analysis) tool to convert these string data into numerical matrices and applies three machine learning techniques, i.e. Naive Bayes (NB), J48, and support vector machine (SVM). The proposed system has been tested on 100 movie reviews and tweets, and it has been observed that SVM has performed best in comparison to other classifiers, and it has an accuracy of 68% for movie reviews and 82% in case of tweets. The results of the proposed system are very promising and can be used in emerging applications like SA of product reviews and social media analysis. Additionally, the proposed system can be used in other cultural/social benefits like predicting/fighting human riots.

Download Full-text

Study on Consistency Analysis in Text Categorization

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.539.181 ◽

2014 ◽

Vol 539 ◽

pp. 181-184

Author(s):

Wan Li Zuo ◽

Zhi Yan Wang ◽

Ning Ma ◽

Hong Liang

Keyword(s):

Text Categorization ◽

Training Data ◽

Experimental Result ◽

Final Decision ◽

Consistency Analysis ◽

Training Set ◽

Weak Classifier ◽

Data Set ◽

Basic Premise

Accurate classification of text is a basic premise of extracting various types of information on the Web efficiently and utilizing the network resources properly. In this paper, a brand new text classification method was proposed. Consistency analysis method is a type of iterative algorithm, which mainly trains different classifiers (weak classifier) by aiming at the same training set, and then these classifiers will be gathered for testing the consistency degrees of various classification methods for the same text, thus to manifest the knowledge of each type of classifier. It main determines the weight of each sample according to the fact is the classification of each sample is accurate in each training set, as well as the accuracy of the last overall classification, and then sends the new data set whose weight has been modified to the subordinate classifier for training. In the end, the classifier gained in the training will be integrated as the final decision classifier. The classifier with consistency analysis can eliminate some unnecessary training data characteristics and place the key words on key training data. According to the experimental result, the average accuracy of this method is 91.0%, while the average recall rate is 88.1%.

Download Full-text

Evaluation of Power Insulator Detection Efficiency with the Use of Limited Training Dataset

Applied Sciences ◽

10.3390/app10062104 ◽

2020 ◽

Vol 10 (6) ◽

pp. 2104

Author(s):

Michał Tomaszewski ◽

Paweł Michalski ◽

Jakub Osuchowski

Keyword(s):

Neural Network ◽

Neural Networks ◽

Object Detection ◽

Convolutional Neural Network ◽

Deep Neural Networks ◽

Detection Efficiency ◽

Training Data ◽

Training Dataset ◽

Training Set ◽

Convolutional Network

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.

Download Full-text

Sentiment analysis on Twitter Data-set using Naive Bayes algorithm

2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT) ◽

10.1109/icatcct.2016.7912034 ◽

2016 ◽

Cited By ~ 20

Author(s):

Huma Parveen ◽

Shikha Pandey

Keyword(s):

Sentiment Analysis ◽

Naive Bayes ◽

Naïve Bayes ◽

Data Set ◽

Twitter Data ◽

Bayes Algorithm

Download Full-text

Texts of “internet confessions” as a source for training data set for the research on the sentiment-analysis field

Vestnik NSU Series Linguistics and Intercultural Communication ◽

10.25205/1818-7935-2019-17-3-71-82 ◽

2019 ◽

Vol 17 (3) ◽

pp. 71-82

Author(s):

Anastasia V. Kolmogorova

Keyword(s):

Sentiment Analysis ◽

Narrative Structure ◽

Training Data ◽

Data Set ◽

Financial Reports ◽

Technological Basis ◽

Self Image ◽

Textual Data ◽

Primary Advantage ◽

Multiclass Classifier

The article aims to analyze the validity of Internet confession texts used as a source of training data set for designing computer classifier of Internet texts in Russian according to their emotional tonality. Thus, the classifier, backed by Lövheim’s emotional cube model, is expected to detect eight classes of emotions represented in the text or to assign the text to the emotionally neutral class. The first and one of the most important stages of the classifier creation is the training data set selection. The training data set in Machine Learning is the actual dataset used to train the model for performing various actions. The internet text genres that are traditionally used in sentiment analysis to train two or three tonalities classifiers are twits, films and market reviews, blogs and financial reports. The novelty of our project consists in designing multiclass classifier that requires a new non-trivial training data. As such, we have chosen the texts from public group Overheard in Russian social network VKontakte. As all texts show similarities, we united them under the genre name “Internet confession”. To feature the genre, we applied the method of narrative semiotics describing six positions forming the deep narrative structure of “Internet confession”: Addresser – a person aware of her/his separateness from the society; Addressee – society / public opinion; Subject – a narrator describing his / her emotional state; Object – the person’s self-image; Helper – the person’s frankness; Adversary – the person’s shame. The above mentioned genre features determine its primary advantage – a qualitative one – to be especially focused on the emotionality while more traditional sources of textual data are based on such categories as expressivity (twits) or axiological estimations (all sorts of reviews). The structural analysis of texts under discussion has also demonstrated several advantages due to the technological basis of the Overheard project: the text hashtagging prevents the researcher from submitting the whole collection to the crowdsourcing assessment; its size is optimal for assessment by experts; despite their hyperbolized emotionality, the texts of Internet confession genre share the stylistic features typical of different types of personal internet discourse. However, the narrative character of all Internet confession texts implies some restrictions in their use within sentiment analysis project.

Download Full-text

On Realistically Attacking Tor with Website Fingerprinting

Proceedings on Privacy Enhancing Technologies ◽

10.1515/popets-2016-0027 ◽

2016 ◽

Vol 2016 (4) ◽

pp. 21-36 ◽

Cited By ~ 25

Author(s):

Tao Wang ◽

Ian Goldberg

Keyword(s):

Background Noise ◽

Laboratory Tests ◽

Training Data ◽

Web Traffic ◽

Training Set ◽

Data Set ◽

Laboratory Conditions ◽

Testing Data ◽

In The Wild ◽

New Algorithms

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.

Download Full-text

Genome-Wide Identification of a Novel Autophagy-Related Signature for Colorectal Cancer

Dose-Response ◽

10.1177/1559325819894179 ◽

2019 ◽

Vol 17 (4) ◽

pp. 155932581989417 ◽

Cited By ~ 6

Author(s):

Zhi Huang ◽

Jie Liu ◽

Liang Luo ◽

Pan Sheng ◽

Biao Wang ◽

...

Keyword(s):

Colorectal Cancer ◽

Signaling Pathway ◽

Risk Score ◽

Low Risk ◽

Training Data ◽

The Cancer Genome Atlas ◽

Training Set ◽

Data Set ◽

Validation Set ◽

Cox Analysis

Background: Plenty of evidence has suggested that autophagy plays a crucial role in the biological processes of cancers. This study aimed to screen autophagy-related genes (ARGs) and establish a novel a scoring system for colorectal cancer (CRC). Methods: Autophagy-related genes sequencing data and the corresponding clinical data of CRC in The Cancer Genome Atlas were used as training data set. The GSE39582 data set from the Gene Expression Omnibus was used as validation set. An autophagy-related signature was developed in training set using univariate Cox analysis followed by stepwise multivariate Cox analysis and assessed in the validation set. Then we analyzed the function and pathways of ARGs using Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Finally, a prognostic nomogram combining the autophagy-related risk score and clinicopathological characteristics was developed according to multivariate Cox analysis. Results: After univariate and multivariate analysis, 3 ARGs were used to construct autophagy-related signature. The KEGG pathway analyses showed several significantly enriched oncological signatures, such as p53 signaling pathway, apoptosis, human cytomegalovirus infection, platinum drug resistance, necroptosis, and ErbB signaling pathway. Patients were divided into high- and low-risk groups, and patients with high risk had significantly shorter overall survival (OS) than low-risk patients in both training set and validation set. Furthermore, the nomogram for predicting 3- and 5-year OS was established based on autophagy-based risk score and clinicopathologic factors. The area under the curve and calibration curves indicated that the nomogram showed well accuracy of prediction. Conclusions: Our proposed autophagy-based signature has important prognostic value and may provide a promising tool for the development of personalized therapy.

Download Full-text

Reviewing Sentiment Analysis at the Shallow End

Transactions on Machine Learning and Artificial Intelligence ◽

10.14738/tmlai.84.8274 ◽

2020 ◽

Vol 8 (4) ◽

pp. 47-62

Author(s):

Francisca Oladipo ◽

Ogunsanya, F. B ◽

Musa, A. E. ◽

Ogbuju, E. E ◽

Ariwa, E.

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Information Exchange ◽

Training Data ◽

Data Set ◽

The Social ◽

Machine Learning Approach ◽

Media Space ◽

Social Media Platforms

The social media space has evolved into a large labyrinth of information exchange platform and due to the growth in the adoption of different social media platforms, there has been an increasing wave of interests in sentiment analysis as a paradigm for the mining and analysis of users’ opinions and sentiments based on their posts. In this paper, we present a review of contextual sentiment analysis on social media entries with a specific focus on Twitter. The sentimental analysis consists of two broad approaches which are machine learning which uses classification techniques to classify text and is further categorized into supervised learning and unsupervised learning; and the lexicon-based approach which uses a dictionary without using any test or training data set, unlike the machine learning approach.

Download Full-text

Utilizing the Road Mark Training Set from Ground-Based Mapping System to Airborne Imagery in Deep Learning Framework

Abstracts of the ICA ◽

10.5194/ica-abs-1-364-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-1

Author(s):

Tee-Ann Teo

Keyword(s):

Neural Network ◽

Deep Learning ◽

Spatial Resolution ◽

Image Features ◽

Training Data ◽

Training Set ◽

Data Set ◽

The Road ◽

Mapping System ◽

Close Range

Abstract. Deep Learning is a kind of Machine Learning technology which utilizing the deep neural network to learn a promising model from a large training data set. Convolutional Neural Network (CNN) has been successfully applied in image segmentation and classification with high accuracy results. The CNN applies multiple kernels (also called filters) to extract image features via image convolution. It is able to determine multiscale features through the multiple layers of convolution and pooling processes. The variety of training data plays an important role to determine a reliable CNN model. The benchmarking training data for road mark extraction is mainly focused on close-range imagery because it is easier to obtain a close-range image rather than an airborne image. For example, KITTI Vision Benchmark Suite. This study aims to transfer the road mark training data from mobile lidar system to aerial orthoimage in Fully Convolutional Networks (FCN). The transformation of the training data from ground-based system to airborne system may reduce the effort of producing a large training data set.This study uses FCN technology and aerial orthoimage to localize road marks on the road regions. The road regions are first extracted from 2-D large-scale vector map. The input aerial orthoimage is 10&thinsp;cm spatial resolution and the non-road regions are masked out before the road mark localization. The training data are road mark’s polygons, which are originally digitized from ground-based mobile lidar and prepared for the road mark extraction using mobile mapping system. This study reuses these training data and applies them for the road mark extraction using aerial orthoimage. The digitized training road marks are then transformed to road polygon based on mapping coordinates. As the detail of ground-based lidar is much better than the airborne system, the partially occulted parking lot in aerial orthoimage can also be obtained from the ground-based system. The labels (also called annotations) for FCN include road region, non-regions and road mark. The size of a training batch is 500&thinsp;pixel by 500&thinsp;pixel (50&thinsp;m by 50&thinsp;m on the ground), and the total number of training batches for training is 75 batches. After the FCN training stage, an independent aerial orthoimage (Figure 1a) is applied to predict the road marks. The results of FCN provide initial regions for road marks (Figure 1b). Usually, road marks show higher reflectance than road asphalts. Therefore, this study uses this characteristic to refine the road marks (Figure 1c) by a binary classification inside the initial road mark’s region.To compare the automatically extracted road marks (Figure 1c) and manually digitized road marks (Figure 1d), most road marks can be extracted using the training set from ground-based system. This study also selects an area of 600&thinsp;m&thinsp;&times;&thinsp;200&thinsp;m in quantitative analysis. Among the 371 reference road marks, 332 can be extracted from proposed scheme, and the completeness reached 89%. The preliminary experiment demonstrated that most road marks can be successfully extracted by the proposed scheme. Therefore, the training data from the ground-based mapping system can be utilized in airborne orthoimage in similar spatial resolution.

Download Full-text

Reviews Sentiment analysis for collaborative recommender system

Kurdistan Journal of Applied Research ◽

10.24017/science.2017.3.22 ◽

2017 ◽

Vol 2 (3) ◽

pp. 87-91 ◽

Cited By ~ 2

Author(s):

Alia Karim Abdul Hassan ◽

Ahmed Bahaa Aldeen Abdulwahhab

Keyword(s):

Sentiment Analysis ◽

Recommender System ◽

Classification Model ◽

Additional Source ◽

Data Set ◽

Data Sparsity ◽

Rating Form ◽

Restaurant Reviews ◽

Analysis System ◽

Source Of Information

recommender system nowadays is used to deliver services and information to users. A recommender system is suffering from problems of data sparsity and cold start because of insufficient user rating or absence of data about users or items. This research proposed a sentiment analysis system work on user reviews as an additional source of information to tackle data sparsity problems. Sentiment analysis system implemented using NLP techniques with machine learning to predict user rating form his review; this model is evaluated using Yelp restaurant data set, IMDB reviews data set, and Arabic qaym.com restaurant reviews data set under various classification model, the system was efficient in predicting rating from reviews.

Download Full-text