scholarly journals Replacing Out-of-Vocabulary Words with an Appropriate Synonym Based on Word2VnCR

2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Jeongin Kim ◽  
Taekeun Hong ◽  
Pankoo Kim

The most typical problem in an analysis of natural language is finding synonyms of out-of-vocabulary (OOV) words. When someone tries to understand a sentence containing an OOV word, the person determines the most appropriate meaning of a replacement word using the meanings of co-occurrence words under the same context based on the conceptual system learned. In this study, a word-to-vector and conceptual relationship (Word2VnCR) algorithm is proposed that replaces an OOV word leading to an erroneous morphemic analysis with an appropriate synonym. TheWord2VnCR algorithm is an improvement over the conventional Word2Vec algorithm, which has a problem in suggesting a replacement word by not determining the similarity of the word. After word-embedding learning is conducted using the learning dataset, the replacement word candidates of the OOV word are extracted. The semantic similarities of the extracted replacement word candidates are measured with the surrounding neighboring words of the OOV word, and a replacement word having the highest similarity value is selected as a replacement. To evaluate the performance of the proposed Word2VnCR algorithm, a comparative experiment was conducted using the Word2VnCR and Word2Vec algorithms. As the experimental results indicate, the proposed algorithm shows a higher accuracy than the Word2Vec algorithm.

Author(s):  
Bin Wang ◽  
Angela Wang ◽  
Fenxiao Chen ◽  
Yuncheng Wang ◽  
C.-C. Jay Kuo

AbstractExtensive evaluation on a large number of word embedding models for language processing applications is conducted in this work. First, we introduce popular word embedding models and discuss desired properties of word models and evaluation methods (or evaluators). Then, we categorize evaluators into intrinsic and extrinsic two types. Intrinsic evaluators test the quality of a representation independent of specific natural language processing tasks while extrinsic evaluators use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. We report experimental results of intrinsic and extrinsic evaluators on six word embedding models. It is shown that different evaluators focus on different aspects of word models, and some are more correlated with natural language processing tasks. Finally, we adopt correlation analysis to study performance consistency of extrinsic and intrinsic evaluators.


2020 ◽  
Vol 34 (10) ◽  
pp. 13969-13970
Author(s):  
Atsuki Yamaguchi ◽  
Katsuhide Fujita

In human-human negotiation, reaching a rational agreement can be difficult, and unfortunately, the negotiations sometimes break down because of conflicts of interests. If artificial intelligence can play a role in assisting with human-human negotiation, it can assist in avoiding negotiation breakdown, leading to a rational agreement. Therefore, this study focuses on end-to-end tasks for predicting the outcome of a negotiation dialogue in natural language. Our task is modeled using a gated recurrent unit and a pre-trained language model: BERT as the baseline. Experimental results demonstrate that the proposed tasks are feasible on two negotiation dialogue datasets, and that signs of a breakdown can be detected in the early stages using the baselines even if the models are used in a partial dialogue history.


Author(s):  
Michaela Regneri ◽  
Marcus Rohrbach ◽  
Dominikus Wetzel ◽  
Stefan Thater ◽  
Bernt Schiele ◽  
...  

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high quality videos with multiple natural language descriptions of the actions portrayed in the videos, together with an annotation of how similar the action descriptions are to each other. Experimental results demonstrate that a text-based model of similarity between actions improves substantially when combined with visual information from videos depicting the described actions.


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Leilei Kong ◽  
Zhongyuan Han ◽  
Yong Han ◽  
Haoliang Qi

Paraphrase identification is central to many natural language applications. Based on the insight that a successful paraphrase identification model needs to adequately capture the semantics of the language objects as well as their interactions, we present a deep paraphrase identification model interacting semantics with syntax (DPIM-ISS) for paraphrase identification. DPIM-ISS introduces the linguistic features manifested in syntactic features to produce more explicit structures and encodes the semantic representation of sentence on different syntactic structures by means of interacting semantics with syntax. Then, DPIM-ISS learns the paraphrase pattern from this representation interacting the semantics with syntax by exploiting a convolutional neural network with convolution-pooling structure. Experiments are conducted on the corpus of Microsoft Research Paraphrase (MSRP), PAN 2010 corpus, and PAN 2012 corpus for paraphrase plagiarism detection. The experimental results demonstrate that DPIM-ISS outperforms the classical word-matching approaches, the syntax-similarity approaches, the convolution neural network-based models, and some deep paraphrase identification models.


Author(s):  
Son Doan ◽  
◽  
Susumu Horiguchi ◽  

Text categorization involves assigning a natural language document to one or more predefined classes. One of the most interesting issues is feature selection. We propose an approach using multicriteria ranking of eatures, a new procedure for feature selection, and apply these to text categorization. Experimental results dealing with Reuters-21578 and 20Newsgroups benchmark data and the naive Bayes algorithm show that our proposal outperforms conventional feature selection in text categorization performance.


2004 ◽  
Vol 13 (04) ◽  
pp. 813-828 ◽  
Author(s):  
JING XIAO ◽  
TAT-SENG CHUA ◽  
JIMIN LIU

The ability to extract desired pieces of information from natural language texts is an important task with a growing number of potential applications. This paper presents a novel pattern rule induction learning system, GRID, which emphasizes the use of global feature distribution in all of the training instances in order to make better decision on rule induction. GRID uses chunks as contextual units instead of tokens, and incorporates features at lexical, syntactical and semantic levels simultaneously. The features chosen in GRID are general and they were applied successfully to both semi-structured text and free text. Our experimental results on some publicly available webpage corpora and MUC-4 test set indicate that our approach is effective.


2020 ◽  
Author(s):  
Masashi Sugiyama

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


2021 ◽  
Vol 1 (2) ◽  
pp. 18-22
Author(s):  
Strahil Sokolov ◽  
Stanislava Georgieva

This paper presents a new approach to processing and categorization of text from patient documents in Bulgarian language using Natural Language Processing and Edge AI. The proposed algorithm contains several phases - personal data anonymization, pre-processing and conversion of text to vectors, model training and recognition. The experimental results in terms of achieved accuracy are comparable with modern approaches.


Author(s):  
Zhiguo Wang ◽  
Wael Hamza ◽  
Radu Florian

Natural language sentence matching is a fundamental technology for a variety of tasks. Previous approaches either match sentences from a single direction or only apply single granular (word-by-word or sentence-by-sentence) matching. In this work, we propose a bilateral multi-perspective matching (BiMPM) model. Given two sentences P and Q, our model first encodes them with a BiLSTM encoder. Next, we match the two encoded sentences in two directions P against Q and P against Q. In each matching direction, each time step of one sentence is matched against all time-steps of the other sentence from multiple perspectives. Then, another BiLSTM layer is utilized to aggregate the matching results into a fix-length matching vector. Finally, based on the matching vector, a decision is made through a fully connected layer. We evaluate our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.


Sign in / Sign up

Export Citation Format

Share Document