A cross-corpus study of subjectivity identification using unsupervised learning

2011 ◽  
Vol 18 (3) ◽  
pp. 375-397 ◽  
Author(s):  
DONG WANG ◽  
YANG LIU

AbstractIn this study, we investigate using unsupervised generative learning methods for subjectivity detection across different domains. We create an initial training set using simple lexicon information and then evaluate two iterative learning methods with a base naive Bayes classifier to learn from unannotated data. The first method is self-training, which adds instances with high confidence into the training set in each iteration. The second is a calibrated EM (expectation-maximization) method where we calibrate the posterior probabilities from EM such that the class distribution is similar to that in the real data. We evaluate both approaches on three different domains: movie data, news resource, and meeting dialogues, and we found that in some cases the unsupervised learning methods can achieve performance close to the fully supervised setup. We perform a thorough analysis to examine factors, such as self-labeling accuracy of the initial training set in unsupervised learning, the accuracy of the added examples in self-training, and the size of the initial training set in different methods. Our experiments and analysis show inherent differences across domains and impacting factors explaining the model behaviors.

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Nasser Assery ◽  
Yuan (Dorothy) Xiaohong ◽  
Qu Xiuli ◽  
Roy Kaushik ◽  
Sultan Almalki

Purpose This study aims to propose an unsupervised learning model to evaluate the credibility of disaster-related Twitter data and present a performance comparison with commonly used supervised machine learning models. Design/methodology/approach First historical tweets on two recent hurricane events are collected via Twitter API. Then a credibility scoring system is implemented in which the tweet features are analyzed to give a credibility score and credibility label to the tweet. After that, supervised machine learning classification is implemented using various classification algorithms and their performances are compared. Findings The proposed unsupervised learning model could enhance the emergency response by providing a fast way to determine the credibility of disaster-related tweets. Additionally, the comparison of the supervised classification models reveals that the Random Forest classifier performs significantly better than the SVM and Logistic Regression classifiers in classifying the credibility of disaster-related tweets. Originality/value In this paper, an unsupervised 10-point scoring model is proposed to evaluate the tweets’ credibility based on the user-based and content-based features. This technique could be used to evaluate the credibility of disaster-related tweets on future hurricanes and would have the potential to enhance emergency response during critical events. The comparative study of different supervised learning methods has revealed effective supervised learning methods for evaluating the credibility of Tweeter data.


Author(s):  
Cara Murphy ◽  
John Kerekes

The classification of trace chemical residues through active spectroscopic sensing is challenging due to the lack of physics-based models that can accurately predict spectra. To overcome this challenge, we leveraged the field of domain adaptation to translate data from the simulated to the measured domain for training a classifier. We developed the first 1D conditional generative adversarial network (GAN) to perform spectrum-to-spectrum translation of reflectance signatures. We applied the 1D conditional GAN to a library of simulated spectra and quantified the improvement in classification accuracy on real data using the translated spectra for training the classifier. Using the GAN-translated library, the average classification accuracy increased from 0.622 to 0.723 on real chemical reflectance data, including data from chemicals not included in the GAN training set.


2021 ◽  
Author(s):  
Ruxian Wang

The growth of market size is crucially important to firms, although researchers often assume that market size is constant in assortment and pricing management. I develop a model that incorporates the market expansion effects into discrete consumer choice models and investigate various operations problems. Market size, measured by the number of people who are interested in the products from the same category, is largely influenced by firms’ operations strategy, and it also affects assortment planning and pricing decisions. Failure to account for market expansion effects may lead to substantial losses in demand estimation and operations management. Based on real data, this paper uses an alternating-optimization expectation-maximization method that separates the estimation of consumer choice behavior and market expansion effects to calibrate the new model. The end-to-end solution approach on modeling, operations, and estimation is readily applicable in real business.


Author(s):  
Sook-Ling Chua ◽  
Stephen Marsland ◽  
Hans W. Guesgen

The problem of behaviour recognition based on data from sensors is essentially an inverse problem: given a set of sensor observations, identify the sequence of behaviours that gave rise to them. In a smart home, the behaviours are likely to be the standard human behaviours of living, and the observations will depend upon the sensors that the house is equipped with. There are two main approaches to identifying behaviours from the sensor stream. One is to use a symbolic approach, which explicitly models the recognition process. Another is to use a sub-symbolic approach to behaviour recognition, which is the focus in this chapter, using data mining and machine learning methods. While there have been many machine learning methods of identifying behaviours from the sensor stream, they have generally relied upon a labelled dataset, where a person has manually identified their behaviour at each time. This is particularly tedious to do, resulting in relatively small datasets, and is also prone to significant errors as people do not pinpoint the end of one behaviour and commencement of the next correctly. In this chapter, the authors consider methods to deal with unlabelled sensor data for behaviour recognition, and investigate their use. They then consider whether they are best used in isolation, or should be used as preprocessing to provide a training set for a supervised method.


2019 ◽  
Vol 7 (4) ◽  
pp. T911-T922
Author(s):  
Satyakee Sen ◽  
Sribharath Kainkaryam ◽  
Cen Ong ◽  
Arvind Sharma

Salt model building has long been considered a severe bottleneck for large-scale 3D seismic imaging projects. It is one of the most time-consuming, labor-intensive, and difficult-to-automate processes in the entire depth imaging workflow requiring significant intervention by domain experts to manually interpret the salt bodies on noisy, low-frequency, and low-resolution seismic images at each iteration of the salt model building process. The difficulty and need for automating this task is well-recognized by the imaging community and has propelled the use of deep-learning-based convolutional neural network (CNN) architectures to carry out this task. However, significant challenges remain for reliable production-scale deployment of CNN-based methods for salt model building. This is mainly due to the poor generalization capabilities of these networks. When used on new surveys, never seen by the CNN models during the training stage, the interpretation accuracy of these models drops significantly. To remediate this key problem, we have introduced a U-shaped encoder-decoder type CNN architecture trained using a specialized regularization strategy aimed at reducing the generalization error of the network. Our regularization scheme perturbs the ground truth labels in the training set. Two different perturbations are discussed: one that randomly changes the labels of the training set, flipping salt labels to sediments and vice versa and the second that smooths the labels. We have determined that such perturbations act as a strong regularizer preventing the network from making highly confident predictions on the training set and thus reducing overfitting. An ensemble strategy is also used for test time augmentation that is shown to further improve the accuracy. The robustness of our CNN models, in terms of reduced generalization error and improved interpretation accuracy is demonstrated with real data examples from the Gulf of Mexico.


2016 ◽  
Author(s):  
Philip B. Holden ◽  
H. John B. Birks ◽  
Stephen J. Brooks ◽  
Mark B. Bush ◽  
Grace M. Hwang ◽  
...  

Abstract. We describe the Bayesian User-friendly Model for Palaeo-Environmental Reconstruction (BUMPER), a Bayesian transfer function for inferring past climate from microfossil assemblages. BUMPER is fully self-calibrating, straightforward to apply, and computationally fast, requiring ~ 2 seconds to build a 100-species model from a 100-site training-set on a standard personal computer. We apply the model's probabilistic framework to generate thousands of artificial training-sets under ideal assumptions. We then use these data to demonstrate the sensitivity of reconstructions to the characteristics of the training-set, considering assemblage richness, species tolerances, and the number of training sites. We find that a useful guideline for the size of a training-set is to provide, on average, at least ten samples of each species. We demonstrate general applicability to real data, considering three different organism types (chironomids, diatoms, pollen) and different reconstructed variables. An identically configured model is used in each application, the only change being the input files that provide the training-set environment and species-count data. The performance of BUMPER is shown to be comparable with Weighted Average Partial Least Squares (WAPLS) in each case. Additional artificial datasets are constructed with similar characteristics to the real data, and these are used to explore the reasons for the differing performances of the different training-sets.


2020 ◽  
Vol 34 (04) ◽  
pp. 4667-4674 ◽  
Author(s):  
Shikun Li ◽  
Shiming Ge ◽  
Yingying Hua ◽  
Chunhui Zhang ◽  
Hao Wen ◽  
...  

Typically, learning a deep classifier from massive cleanly annotated instances is effective but impractical in many real-world scenarios. An alternative is collecting and aggregating multiple noisy annotations for each instance to train the classifier. Inspired by that, this paper proposes to learn deep classifier from multiple noisy annotators via a coupled-view learning approach, where the learning view from data is represented by deep neural networks for data classification and the learning view from labels is described by a Naive Bayes classifier for label aggregation. Such coupled-view learning is converted to a supervised learning problem under the mutual supervision of the aggregated and predicted labels, and can be solved via alternate optimization to update labels and refine the classifiers. To alleviate the propagation of incorrect labels, small-loss metric is proposed to select reliable instances in both views. A co-teaching strategy with class-weighted loss is further leveraged in the deep classifier learning, which uses two networks with different learning abilities to teach each other, and the diverse errors introduced by noisy labels can be filtered out by peer networks. By these strategies, our approach can finally learn a robust data classifier which less overfits to label noise. Experimental results on synthetic and real data demonstrate the effectiveness and robustness of the proposed approach.


Sign in / Sign up

Export Citation Format

Share Document