scholarly journals Lifelong Learning of Topics and Domain-Specific Word Embeddings

Author(s):  
Xiaorui Qin ◽  
Yuyin Lu ◽  
Yufu Chen ◽  
Yanghui Rao
Author(s):  
Hu Xu ◽  
Bing Liu ◽  
Lei Shu ◽  
Philip S. Yu

Learning high-quality domain word embeddings is important for achieving good performance in many NLP tasks. General-purpose embeddings trained on large-scale corpora are often sub-optimal for domain-specific applications. However, domain-specific tasks often do not have large in-domain corpora for training high-quality domain embeddings. In this paper, we propose a novel lifelong learning setting for domain embedding. That is, when performing the new domain embedding, the system has seen many past domains, and it tries to expand the new in-domain corpus by exploiting the corpora from the past domains via meta-learning. The proposed meta-learner characterizes the similarities of the contexts of the same word in many domain corpora, which helps retrieve relevant data from the past domains to expand the new domain corpus. Experimental results show that domain embeddings produced from such a process improve the performance of the downstream tasks.


IEEE Access ◽  
2021 ◽  
Vol 9 ◽  
pp. 137309-137321
Author(s):  
Luca Cagliero ◽  
Moreno La Quatra

2019 ◽  
Author(s):  
José Padarian ◽  
Ignacio Fuentes

Abstract. A large amount of descriptive information is available in most disciplines of geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings lay on a multi-dimensional space where angles and distances have a linguistic interpretation. We used 280 764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations, namely: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. Since this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9 %. The resulting embedding and test suite will be made available for other researchers to use an expand.


Symmetry ◽  
2020 ◽  
Vol 12 (1) ◽  
pp. 89 ◽  
Author(s):  
Hsiang-Yuan Yeh ◽  
Yu-Ching Yeh ◽  
Da-Bai Shen

Linking textual information in finance reports to the stock return volatility provides a perspective on exploring useful insights for risk management. We introduce different kinds of word vector representations in the modeling of textual information: bag-of-words, pre-trained word embeddings, and domain-specific word embeddings. We apply linear and non-linear methods to establish a text regression model for volatility prediction. A large number of collected annually-published financial reports in the period from 1996 to 2013 is used in the experiments. We demonstrate that the domain-specific word vector learned from data not only captures lexical semantics, but also has better performance than the pre-trained word embeddings and traditional bag-of-words model. Our approach significantly outperforms with smaller prediction error in the regression task and obtains a 4%–10% improvement in the ranking task compared to state-of-the-art methods. These improvements suggest that the textual information may provide measurable effects on long-term volatility forecasting. In addition, we also find that the variations and regulatory changes in reports make older reports less relevant for volatility prediction. Our approach opens a new method of research into information economics and can be applied to a wide range of financial-related applications.


2020 ◽  
Author(s):  
Derek Koehl ◽  
Carson Davis ◽  
Rahul Ramachandran ◽  
Udaysankar Nair ◽  
Manil Maskey

<p>Word embedding are numeric representations of text which capture meanings and semantic relationships in text. Embeddings can be constructed using different methods such as One Hot encoding, Frequency-based or Prediction-based approaches. Prediction-based approaches such as  Word2Vec, can be used to generate word embeddings that can capture the underlying semantics and word relationships in a corpus. Word2Vec embeddings generated from domain specific corpus have been shown in studies to both predict relationships and augment word vectors to improve classifications. We describe results from two different experiments utilizing word embeddings for Earth science constructed from a corpus of over 20,000 journal papers using Word2Vec. </p><p>The first experiment explores the analogy prediction performance of word embeddings built from the Earth science journal corpus and trained using domain-specific vocabulary. Our results demonstrate that the accuracy of domain-specific word embeddings in predicting Earth science analogy questions outperforms the ability of general corpus embedding to predict general analogy questions. While the results are as anticipated,  the substantial increase in accuracy, particularly in the lexicographical domain was encouraging. The results point to the need for developing a comprehensive Earth science analogy test set that covers the full breadth of lexicographical and encyclopedic categories for validating word embeddings.</p><p>The second experiment utilizes the word embeddings to augment metadata keyword classifications. Metadata describing NASA datasets have science keywords that are manually assigned which can lead to errors and inconsistencies. These science keywords are controlled vocabulary and are used to aid data discovery via faceted search and relevancy ranking. Given the small size of the number of metadata records with proper description and keywords, word embeddings were used for augmentation. A fully connected neural network was trained to suggest keywords given a description text. This approach provided the best accuracy at ~76% as compared to other methods tested.</p>


Mathematics ◽  
2021 ◽  
Vol 9 (16) ◽  
pp. 1941
Author(s):  
Gordana Ispirova ◽  
Tome Eftimov ◽  
Barbara Koroušić Seljak

Being both a poison and a cure for many lifestyle and non-communicable diseases, food is inscribing itself into the prime focus of precise medicine. The monitoring of few groups of nutrients is crucial for some patients, and methods for easing their calculations are emerging. Our proposed machine learning pipeline deals with nutrient prediction based on learned vector representations on short text–recipe names. In this study, we explored how the prediction results change when, instead of using the vector representations of the recipe description, we use the embeddings of the list of ingredients. The nutrient content of one food depends on its ingredients; therefore, the text of the ingredients contains more relevant information. We define a domain-specific heuristic for merging the embeddings of the ingredients, which combines the quantities of each ingredient in order to use them as features in machine learning models for nutrient prediction. The results from the experiments indicate that the prediction results improve when using the domain-specific heuristic. The prediction models for protein prediction were highly effective, with accuracies up to 97.98%. Implementing a domain-specific heuristic for combining multi-word embeddings yields better results than using conventional merging heuristics, with up to 60% more accuracy in some cases.


Sign in / Sign up

Export Citation Format

Share Document