Word embeddings for application in geosciences: development, evaluation and examples of soil-related concepts

Mapping Intimacies ◽

10.5194/soil-2018-44 ◽

2019 ◽

Author(s):

José Padarian ◽

Ignacio Fuentes

Keyword(s):

Language Processing ◽

Dimensional Space ◽

Language Model ◽

Test Suite ◽

Word Embeddings ◽

General Domain ◽

Domain Specific ◽

Descriptive Information ◽

Development Evaluation ◽

Numerical Representations

Abstract. A large amount of descriptive information is available in most disciplines of geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings lay on a multi-dimensional space where angles and distances have a linguistic interpretation. We used 280 764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations, namely: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. Since this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9 %. The resulting embedding and test suite will be made available for other researchers to use an expand.

Download Full-text

Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts

SOIL ◽

10.5194/soil-5-177-2019 ◽

2019 ◽

Vol 5 (2) ◽

pp. 177-187 ◽

Cited By ~ 2

Author(s):

José Padarian ◽

Ignacio Fuentes

Keyword(s):

Language Processing ◽

Language Model ◽

Numerical Data ◽

Test Suite ◽

Multidimensional Space ◽

Word Embeddings ◽

General Domain ◽

Domain Specific ◽

Descriptive Information ◽

Development Evaluation

Abstract. A large amount of descriptive information is available in geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings, which encode information about a word and its linguistic relationships with other words, lay on a multidimensional space where angles and distances have a linguistic interpretation. We used 280 764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. As this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9 %. We also presented an example were we successfully emulated part of a taxonomic analysis of soil profiles that was originally applied to soil numerical data, which would not be possible without the use of embeddings. The resulting embedding and test suite will be made available for other researchers to use and expand upon.

Download Full-text

The effect of word embeddings and domain specific long-range contextual information on a Recurrent Neural Network Language Model

2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA) ◽

10.1109/robomech.2019.8704827 ◽

2019 ◽

Author(s):

Linda Khumalo ◽

Georg I. Schltinz ◽

Quentin Williams

Keyword(s):

Neural Network ◽

Long Range ◽

Recurrent Neural Network ◽

Contextual Information ◽

Language Model ◽

Word Embeddings ◽

Domain Specific ◽

Network Language

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

Learning Syllables Using Conv-LSTM Model for Swahili Word Representation and Part-of-speech Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3445975 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-25

Author(s):

Casper Shikali Shivachi ◽

Refuoe Mokhosi ◽

Zhou Shijie ◽

Liu Qihe

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Language Model ◽

Word Embeddings ◽

Short Term ◽

Term Memory ◽

Pos Tagging ◽

Part Of Speech ◽

Word Representation ◽

Long Short Term Memory

The need to capture intra-word information in natural language processing (NLP) tasks has inspired research in learning various word representations at word, character, or morpheme levels, but little attention has been given to syllables from a syllabic alphabet. Motivated by the success of compositional models in morphological languages, we present a Convolutional-long short term memory (Conv-LSTM) model for constructing Swahili word representation vectors from syllables. The unified architecture addresses the word agglutination and polysemous nature of Swahili by extracting high-level syllable features using a convolutional neural network (CNN) and then composes quality word embeddings with a long short term memory (LSTM). The word embeddings are then validated using a syllable-aware language model ( 31.267 ) and a part-of-speech (POS) tagging task ( 98.78 ), both yielding very competitive results to the state-of-art models in their respective domains. We further validate the language model using Xhosa and Shona, which are syllabic-based languages. The novelty of the study is in its capability to construct quality word embeddings from syllables using a hybrid model that does not use max-over-pool common in CNN and then the exploitation of these embeddings in POS tagging. Therefore, the study plays a crucial role in the processing of agglutinative and syllabic-based languages by contributing quality word embeddings from syllable embeddings, a robust Conv–LSTM model that learns syllables for not only language modeling and POS tagging, but also for other downstream NLP tasks.

Download Full-text

Domain specific word embeddings for natural language processing in radiology

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2020.103665 ◽

2021 ◽

Vol 113 ◽

pp. 103665

Author(s):

Timothy L. Chen ◽

Max Emerling ◽

Gunvant R. Chaudhari ◽

Yeshwant R. Chillakuru ◽

Youngho Seo ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Embeddings ◽

Domain Specific

Download Full-text

Inducing Relational Knowledge from BERT

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6242 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7456-7463 ◽

Cited By ~ 3

Author(s):

Zied Bouraoui ◽

Jose Camacho-Collados ◽

Steven Schockaert

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Word Embeddings ◽

Relational Knowledge ◽

Wide Range ◽

Fine Tune ◽

Standard Word

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.

Download Full-text

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/622 ◽

2020 ◽

Cited By ~ 1

Author(s):

Zhuang Liu ◽

Degen Huang ◽

Kaiyu Huang ◽

Zhuang Li ◽

Jun Zhao

Keyword(s):

Deep Learning ◽

Text Mining ◽

Language Processing ◽

Large Scale ◽

Language Model ◽

Training Data ◽

Domain Specific ◽

Current State ◽

Language Representation ◽

Financial Domain

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

Download Full-text

Ablations over transformer models for biomedical relationship extraction

F1000Research ◽

10.12688/f1000research.24552.1 ◽

2020 ◽

Vol 9 ◽

pp. 710 ◽

Cited By ~ 1

Author(s):

Richard G Jackson ◽

Erik Jansson ◽

Aron Lagerberg ◽

Elliot Ford ◽

Vladimir Poroshin ◽

...

Keyword(s):

Language Processing ◽

Protein Interactions ◽

Recent Model ◽

Training Data ◽

General Domain ◽

Domain Specific ◽

Relationship Extraction ◽

Language Modelling ◽

Model Training ◽

Order Of Training

Background: Masked language modelling approaches have enjoyed success in improving benchmark performance across many general and biomedical domain natural language processing tasks, including biomedical relationship extraction (RE). However, the recent surge in both the number of novel architectures and the volume of training data they utilise may lead us to question whether domain specific pretrained models are necessary. Additionally, recent work has proposed novel classification heads for RE tasks, further improving performance. Here, we perform ablations over several pretrained models and classification heads to try to untangle the perceived benefits of each. Methods: We use a range of string preprocessing strategies, combined with Bidirectional Encoder Representations from Transformers (BERT), BioBERT and RoBERTa architectures to perform ablations over three RE datasets pertaining to drug-drug and chemical protein interactions, and general domain relationship extraction. We explore the use of the RBERT classification head, compared to a simple linear classification layer across all architectures and datasets. Results: We observe a moderate performance benefit in using the BioBERT pretrained model over the BERT base cased model, although there appears to be little difference when comparing BioBERT to RoBERTa large. In addition, we observe a substantial benefit of using the RBERT head on the general domain RE dataset, but this is not consistently reflected in the biomedical RE datasets. Finally, we discover that randomising the token order of training data does not result in catastrophic performance degradation in our selected tasks. Conclusions: We find a recent general domain pretrained model performs approximately the same as a biomedical specific one, suggesting that domain specific models may be of limited use given the tendency of recent model pretraining regimes to incorporate ever broader sets of data. In addition, we suggest that care must be taken in RE model training, to prevent fitting to non-syntactic features of datasets.

Download Full-text

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/508 ◽

2020 ◽

Author(s):

Juntao Li ◽

Ruidan He ◽

Hai Ye ◽

Hwee Tou Ng ◽

Lidong Bing ◽

...

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Low Resource ◽

Performance Improvements ◽

Domain Specific ◽

High Resource ◽

Significant Performance ◽

Cross Lingual

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.

Download Full-text

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

ACM Transactions on Computing for Healthcare ◽

10.1145/3458754 ◽

2022 ◽

Vol 3 (1) ◽

pp. 1-23

Author(s):

Yu Gu ◽

Robert Tinn ◽

Hao Cheng ◽

Michael Lucas ◽

Naoto Usuyama ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

General Domain ◽

Domain Specific ◽

And Task

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .

Download Full-text