Domain specific word embeddings for natural language processing in radiology

Multi-Sense Embeddings per Word

10.31219/osf.io/udfhn ◽

2020 ◽

Author(s):

Masashi Sugiyama

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Non Parametric

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.

Download Full-text

Computationally Efficient Learning of Quality Controlled Word Embeddings for Natural Language Processing

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) ◽

10.1109/isvlsi.2019.00033 ◽

2019 ◽

Author(s):

Mohammed Alawad ◽

Georgia Tourassi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Embeddings ◽

Computationally Efficient ◽

Efficient Learning

Download Full-text

Statistical Unigram Analysis for Source Code Repository

International Journal of Semantic Computing ◽

10.1142/s1793351x18400123 ◽

2018 ◽

Vol 12 (02) ◽

pp. 237-260

Author(s):

Weifeng Xu ◽

Dianxiang Xu ◽

Abdulrahman Alatawi ◽

Omar El Ariss ◽

Yunkai Liu

Keyword(s):

Natural Language Processing ◽

Empirical Study ◽

Natural Language ◽

Programming Languages ◽

Language Processing ◽

Probabilistic Model ◽

Source Code ◽

Code Analysis ◽

Domain Specific ◽

Language Corpus

Unigram is a fundamental element of [Formula: see text]-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical properties regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. We describe a probabilistic model which relies on these properties for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. Our empirical study shows that using the unigrams extracted from source code repository outperforms the using of the natural language corpus by 21% when solving the domain specific problems.

Download Full-text

Learning adaptive representations for entity recognition in the biomedical domain

Journal of Biomedical Semantics ◽

10.1186/s13326-021-00238-0 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Ivano Lauriola ◽

Fabio Aiolli ◽

Alberto Lavelli ◽

Fabio Rinaldi

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Machine Learning Algorithms ◽

Entity Recognition ◽

Machine Learning Techniques ◽

Hybrid Architecture ◽

Biomedical Domain ◽

Word Embeddings

Abstract Background Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task. Results This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F1 score. Conclusions Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution.

Download Full-text

Natural Language Processing in OTF Computing: Challenges and the Need for Interactive Approaches

Computers ◽

10.3390/computers8010022 ◽

2019 ◽

Vol 8 (1) ◽

pp. 22

Author(s):

Frederik Bäumer ◽

Joschka Kersting ◽

Michaela Geierhos

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Ad Hoc ◽

Domain Specific ◽

Compensation Process ◽

Language Requirement ◽

Chat Bot ◽

The Given ◽

Software Services

The vision of On-the-Fly (OTF) Computing is to compose and provide software services ad hoc, based on requirement descriptions in natural language. Since non-technical users write their software requirements themselves and in unrestricted natural language, deficits occur such as inaccuracy and incompleteness. These deficits are usually met by natural language processing methods, which have to face special challenges in OTF Computing because maximum automation is the goal. In this paper, we present current automatic approaches for solving inaccuracies and incompletenesses in natural language requirement descriptions and elaborate open challenges. In particular, we will discuss the necessity of domain-specific resources and show why, despite far-reaching automation, an intelligent and guided integration of end users into the compensation process is required. In this context, we present our idea of a chat bot that integrates users into the compensation process depending on the given circumstances.

Download Full-text

Natural Language Processing Techniques for Document Classification in IT Benchmarking - Automated Identification of Domain Specific Terms

Proceedings of the 17th International Conference on Enterprise Information Systems ◽

10.5220/0005462303600366 ◽

2015 ◽

Cited By ~ 2

Author(s):

Matthias Pfaff ◽

Helmut Krcmar

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Document Classification ◽

Automated Identification ◽

Domain Specific ◽

Processing Techniques ◽

It Benchmarking

Download Full-text

Evaluation of Dimensionality Reduction and Truncation Techniques for Word Embeddings

10.5753/eniac.2018.4477 ◽

2018 ◽

Author(s):

Paulo Henrique Calado Aoun ◽

Andre C. A. Nascimento ◽

Adenilton J. Da Silva

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Dimensionality Reduction ◽

Mobile Devices ◽

Language Processing ◽

Word Embeddings ◽

Reduction Strategies ◽

Truncation Techniques

The use of word embeddings is becoming very common in many Natural Language Processing tasks. Most of the time, these require computacional resources that can not be found in most part of the current mobile devices. In this work, we evaluate a combination of numeric truncation and dimensionality reduction strategies in order to obtain smaller vectorial representations without substancial losses in performance.

Download Full-text

NMT Multi-Sense Embeddings per Word

10.31219/osf.io/k623t ◽

2019 ◽

Author(s):

William Jin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Non Parametric

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.

Download Full-text

Inducing Relational Knowledge from BERT

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6242 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7456-7463 ◽

Cited By ~ 3

Author(s):

Zied Bouraoui ◽

Jose Camacho-Collados ◽

Steven Schockaert

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Word Embeddings ◽

Relational Knowledge ◽

Wide Range ◽

Fine Tune ◽

Standard Word

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.

Download Full-text

Getting in Shape: Word Embedding SubSpaces

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/761 ◽

2019 ◽

Author(s):

Tianyuan Zhou ◽

João Sedoc ◽

Jordan Rodu

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Theoretical Framework ◽

Word Embedding ◽

Word Embeddings ◽

Empirical Results ◽

Linear Alignment ◽

The Relationship

Many tasks in natural language processing require the alignment of word embeddings. Embedding alignment relies on the geometric properties of the manifold of word vectors. This paper focuses on supervised linear alignment and studies the relationship between the shape of the target embedding. We assess the performance of aligned word vectors on semantic similarity tasks and find that the isotropy of the target embedding is critical to the alignment. Furthermore, aligning with an isotropic noise can deliver satisfactory results. We provide a theoretical framework and guarantees which aid in the understanding of empirical results.

Download Full-text