scholarly journals Pronunciation-Enhanced Chinese Word Embedding

Author(s):  
Qinjuan Yang ◽  
Haoran Xie ◽  
Gary Cheng ◽  
Fu Lee Wang ◽  
Yanghui Rao

AbstractChinese word embeddings have recently garnered considerable attention. Chinese characters and their sub-character components, which contain rich semantic information, are incorporated to learn Chinese word embeddings. Chinese characters can represent a combination of meaning, structure, and pronunciation. However, existing embedding learning methods focus on the structure and meaning of Chinese characters. In this study, we aim to develop an embedding learning method that can make complete use of the information represented by Chinese characters, including phonology, morphology, and semantics. Specifically, we propose a pronunciation-enhanced Chinese word embedding learning method, where the pronunciations of context characters and target characters are simultaneously encoded into the embeddings. Evaluation of word similarity, word analogy reasoning, text classification, and sentiment analysis validate the effectiveness of our proposed method.

Author(s):  
Julien Tissier ◽  
Christophe Gravier ◽  
Amaury Habrard

Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ∼2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.


Author(s):  
Xiang Lisa Li ◽  
Jason Eisner

Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.


2020 ◽  
Author(s):  
Masashi Sugiyama

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


2020 ◽  
pp. 1-35
Author(s):  
N. Pittaras ◽  
G. Giannakopoulos ◽  
G. Papadakis ◽  
V. Karkaletsis

Abstract The recent breakthroughs in deep neural architectures across multiple machine learning fields have led to the widespread use of deep neural models. These learners are often applied as black-box models that ignore or insufficiently utilize a wealth of preexisting semantic information. In this study, we focus on the text classification task, investigating methods for augmenting the input to deep neural networks (DNNs) with semantic information. We extract semantics for the words in the preprocessed text from the WordNet semantic graph, in the form of weighted concept terms that form a semantic frequency vector. Concepts are selected via a variety of semantic disambiguation techniques, including a basic, a part-of-speech-based, and a semantic embedding projection method. Additionally, we consider a weight propagation mechanism that exploits semantic relationships in the concept graph and conveys a spreading activation component. We enrich word2vec embeddings with the resulting semantic vector through concatenation or replacement and apply the semantically augmented word embeddings on the classification task via a DNN. Experimental results over established datasets demonstrate that our approach of semantic augmentation in the input space boosts classification performance significantly, with concatenation offering the best performance. We also note additional interesting findings produced by our approach regarding the behavior of term frequency - inverse document frequency normalization on semantic vectors, along with the radical dimensionality reduction potential with negligible performance loss.


2019 ◽  
Author(s):  
William Jin

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


2021 ◽  
pp. 233-252
Author(s):  
Upendar Rao Rayala ◽  
Karthick Seshadri

Sentiment analysis is perceived to be a multi-disciplinary research domain composed of machine learning, artificial intelligence, deep learning, image processing, and social networks. Sentiment analysis can be used to determine opinions of the public about products and to find the customers' interest and their feedback through social networks. To perform any natural language processing task, the input text/comments should be represented in a numerical form. Word embeddings represent the given text/sentences/words as a vector that can be employed in performing subsequent natural language processing tasks. In this chapter, the authors discuss different techniques that can improve the performance of sentiment analysis using concepts and techniques like traditional word embeddings, sentiment embeddings, emoticons, lexicons, and neural networks. This chapter also traces the evolution of word embedding techniques with a chronological discussion of the recent research advancements in word embedding techniques.


2015 ◽  
Vol 37 (3) ◽  
pp. 507-527 ◽  
Author(s):  
JIE ZHANG ◽  
HONG LI ◽  
QIONG DONG ◽  
JIE XU ◽  
ELIZABETH SHOLAR

ABSTRACTThis study investigated whether beginning nonnative learners of Chinese can use phonological and semantic information of radicals to learn the sounds and meanings of new Chinese characters. Thirty-four seventh- and eighth-grade American adolescents, who received intensive Chinese instruction for one semester, were taught 16 compound pseudocharacters paired with novel pictures over three learning trials. After each learning trial, students were asked to produce the sounds and meanings of pseudocharacters in which semantic transparency and phonetic regularity of radicals were manipulated. Results showed a facilitation effect of transparent semantic radicals in learning character meanings in early trials. There was a trend that students learned to read regular and transparent characters better than irregular and opaque characters. The ability to learn orthography–pronunciation association uniquely predicted Chinese word reading after controlling for semantic and phonetic radical knowledge. These findings suggest a predominant use of semantic strategies and the importance of orthography to phonology mappings in learning to read Chinese for beginning nonnative learners of Chinese.


Sign in / Sign up

Export Citation Format

Share Document