scholarly journals Causally Denoise Word Embeddings Using Half-Sibling Regression

2020 ◽  
Vol 34 (05) ◽  
pp. 9426-9433 ◽  
Author(s):  
Zekun Yang ◽  
Tianlin Liu

Distributional representations of words, also known as word vectors, have become crucial for modern natural language processing tasks due to their wide applications. Recently, a growing body of word vector postprocessing algorithm has emerged, aiming to render off-the-shelf word vectors even stronger. In line with these investigations, we introduce a novel word vector postprocessing scheme under a causal inference framework. Concretely, the postprocessing pipeline is realized by Half-Sibling Regression (HSR), which allows us to identify and remove confounding noise contained in word vectors. Compared to previous work, our proposed method has the advantages of interpretability and transparency due to its causal inference grounding. Evaluated on a battery of standard lexical-level evaluation tasks and downstream sentiment analysis tasks, our method reaches state-of-the-art performance.

2017 ◽  
Vol 24 (4) ◽  
pp. 813-821 ◽  
Author(s):  
Anne Cocos ◽  
Alexander G Fiks ◽  
Aaron J Masino

Abstract Objective Social media is an important pharmacovigilance data source for adverse drug reaction (ADR) identification. Human review of social media data is infeasible due to data quantity, thus natural language processing techniques are necessary. Social media includes informal vocabulary and irregular grammar, which challenge natural language processing methods. Our objective is to develop a scalable, deep-learning approach that exceeds state-of-the-art ADR detection performance in social media. Materials and Methods We developed a recurrent neural network (RNN) model that labels words in an input sequence with ADR membership tags. The only input features are word-embedding vectors, which can be formed through task-independent pretraining or during ADR detection training. Results Our best-performing RNN model used pretrained word embeddings created from a large, non–domain-specific Twitter dataset. It achieved an approximate match F-measure of 0.755 for ADR identification on the dataset, compared to 0.631 for a baseline lexicon system and 0.65 for the state-of-the-art conditional random field model. Feature analysis indicated that semantic information in pretrained word embeddings boosted sensitivity and, combined with contextual awareness captured in the RNN, precision. Discussion Our model required no task-specific feature engineering, suggesting generalizability to additional sequence-labeling tasks. Learning curve analysis showed that our model reached optimal performance with fewer training examples than the other models. Conclusions ADR detection performance in social media is significantly improved by using a contextually aware model and word embeddings formed from large, unlabeled datasets. The approach reduces manual data-labeling requirements and is scalable to large social media datasets.


2021 ◽  
pp. 233-252
Author(s):  
Upendar Rao Rayala ◽  
Karthick Seshadri

Sentiment analysis is perceived to be a multi-disciplinary research domain composed of machine learning, artificial intelligence, deep learning, image processing, and social networks. Sentiment analysis can be used to determine opinions of the public about products and to find the customers' interest and their feedback through social networks. To perform any natural language processing task, the input text/comments should be represented in a numerical form. Word embeddings represent the given text/sentences/words as a vector that can be employed in performing subsequent natural language processing tasks. In this chapter, the authors discuss different techniques that can improve the performance of sentiment analysis using concepts and techniques like traditional word embeddings, sentiment embeddings, emoticons, lexicons, and neural networks. This chapter also traces the evolution of word embedding techniques with a chronological discussion of the recent research advancements in word embedding techniques.


Author(s):  
Jing Li

Identifying discourse structures and coherence relations in a piece of text is a fundamental task in natural language processing. The first step of this process is segmenting sentences into clause-like units called elementary discourse units (EDUs). Traditional solutions to discourse segmentation heavily rely on carefully designed features. In this demonstration, we present SegBot, a system to split a given piece of text into sequence of EDUs by using an end-to-end neural segmentation model. Our model does not require hand-crafted features or external knowledge except word embeddings, yet it outperforms state-of-the-art solutions to discourse segmentation.


2021 ◽  
Vol 12 ◽  
Author(s):  
Emma Pair ◽  
Nikitha Vicas ◽  
Ann M. Weber ◽  
Valerie Meausoone ◽  
James Zou ◽  
...  

Background: Despite a 2010 Kenyan constitutional amendment limiting members of elected public bodies to < two-thirds of the same gender, only 22 percent of the 12th Parliament members inaugurated in 2017 were women. Investigating gender bias in the media is a useful tool for understanding socio-cultural barriers to implementing legislation for gender equality. Natural language processing (NLP) methods, such as word embedding and sentiment analysis, can efficiently quantify media biases at a scope previously unavailable in the social sciences.Methods: We trained GloVe and word2vec word embeddings on text from 1998 to 2019 from Kenya’s Daily Nation newspaper. We measured gender bias in these embeddings and used sentiment analysis to predict quantitative sentiment scores for sentences surrounding female leader names compared to male leader names.Results: Bias in leadership words for men and women measured from Daily Nation word embeddings corresponded to temporal trends in men and women’s participation in political leadership (i.e., parliamentary seats) using GloVe (correlation 0.8936, p = 0.0067, r2 = 0.799) and word2vec (correlation 0.844, p = 0.0169, r2 = 0.712) algorithms. Women continue to be associated with domestic terms while men continue to be associated with influence terms, for both regular gender words and female and male political leaders’ names. Male words (e.g., he, him, man) were mentioned 1.84 million more times than female words from 1998 to 2019. Sentiment analysis showed an increase in relative negative sentiment associated with female leaders (p = 0.0152) and an increase in positive sentiment associated with male leaders over time (p = 0.0216).Conclusion: Natural language processing is a powerful method for gaining insights into and quantifying trends in gender biases and sentiment in news media. We found evidence of improvement in gender equality but also a backlash from increased female representation in high-level governmental leadership.


2020 ◽  
Author(s):  
Vadim V. Korolev ◽  
Artem Mitrofanov ◽  
Kirill Karpov ◽  
Valery Tkachenko

The main advantage of modern natural language processing methods is a possibility to turn an amorphous human-readable task into a strict mathematic form. That allows to extract chemical data and insights from articles and to find new semantic relations. We propose a universal engine for processing chemical and biological texts. We successfully tested it on various use-cases and applied to a case of searching a therapeutic agent for a COVID-19 disease by analyzing PubMed archive.


2019 ◽  
Vol 53 (2) ◽  
pp. 3-10
Author(s):  
Muthu Kumar Chandrasekaran ◽  
Philipp Mayr

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.


2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


Author(s):  
Mario Jojoa Acosta ◽  
Gema Castillo-Sánchez ◽  
Begonya Garcia-Zapirain ◽  
Isabel de la Torre Díez ◽  
Manuel Franco-Martín

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.


Author(s):  
Warnia Nengsih ◽  
M. Mahrus Zein ◽  
Nazifa Hayati

Sentiment analysis adalah metode untuk memperoleh data dari berbagai platform yang tersedia di internet. Kemajuan teknologi memungkinkan mesin untuk mengenali suatu istilah yang dianggap sebagai opini positif maupun sebaliknya. Data-data dan opini tersebut berperan penting sebagai umpan balik produk, layanan, dan topik lainnya. Tanpa perlu memperoleh opini secara langsung dari masyarakat, pihak penyedia telah mendapatkan evaluasi yang penting guna mengembangkan diri. Bisnis perhotelan merupakan bidang yang terkait dengan jasa memberikan layanan pada pelanggan. Indikator keberlangsungan bisnis ini juga bergantung pada umpan balik pelanggannya dan dijadikan sebagai acuan untuk pengambilan kebijakan strategis. Teknik sentiment analysis berbasis Natural Language Processing dapat mengatasi permasalahan tersebut. Pada makalah ini prediksi dilakukan menggunakan classifier Random Forest (RF), sementara untuk merangkum kualitas classifier, digunakan kurva Receiver Operating Characteristic (ROC). Kurva ROC berupa grafik yang baik untuk merangkum kualitas classifier. Semakin tinggi kurva berada di atas garis diagonal, semakin baik prediksinya, dengan nilai kurva ROC yang diperoleh sebesar 0,90. Terlihat hasil ulasan terhadap opini pelanggan terhadap jasa dan pelayanan yang diberikan oleh hotel untuk kategori positif lebih banyak daripada kategori negatif. Polaritas dari ulasan diperoleh 68% ulasan pelanggan berada pada area positif dan 32% berada pada area negatif.


Sign in / Sign up

Export Citation Format

Share Document