Building Hybrid Representations from Text Corpora, Knowledge Graphs, and Language Models

cs60075_team2 at SemEval-2021 Task 1 : Lexical Complexity Prediction using Transformer-based Language Models pre-trained on various text corpora

10.18653/v1/2021.semeval-1.87 ◽

2021 ◽

Author(s):

Abhilash Nandy ◽

Sayantan Adak ◽

Tanurima Halder ◽

Sai Mahesh Pokala

Keyword(s):

Language Models ◽

Text Corpora ◽

Lexical Complexity

Download Full-text

Exploiting Language Models to Classify Events from Twitter

Computational Intelligence and Neuroscience ◽

10.1155/2015/401024 ◽

2015 ◽

Vol 2015 ◽

pp. 1-11 ◽

Cited By ~ 4

Author(s):

Duc-Thuan Vo ◽

Vo Thuan Hai ◽

Cheol-Young Ock

Keyword(s):

Latent Dirichlet Allocation ◽

Nearest Neighbor ◽

Language Models ◽

K Nearest Neighbor ◽

Text Corpora ◽

Common Term ◽

Selectional Preferences ◽

Linguistic Relations ◽

Relationship Of ◽

Learning Language

Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets’ features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events.

Download Full-text

USING GRAPHEME n-GRAMS IN SPELLING CORRECTION AND AUGMENTATIVE TYPING SYSTEMS

New Mathematics and Natural Computation ◽

10.1142/s1793005708000970 ◽

2008 ◽

Vol 04 (01) ◽

pp. 87-106

Author(s):

ALKET MEMUSHAJ ◽

TAREK M. SOBH

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Computational Efficiency ◽

Probabilistic Models ◽

Language Modeling ◽

Language Models ◽

Text Corpora ◽

N Gram ◽

Changes Over Time ◽

Over Time

Probabilistic language models have gained popularity in Natural Language Processing due to their ability to successfully capture language structures and constraints with computational efficiency. Probabilistic language models are flexible and easily adapted to language changes over time as well as to some new languages. Probabilistic language models can be trained and their accuracy strongly related to the availability of large text corpora. In this paper, we investigate the usability of grapheme probabilistic models, specifically grapheme n-grams models in spellchecking as well as augmentative typing systems. Grapheme n-gram models require substantially smaller training corpora and that is one of the main drivers for this thesis in which we build grapheme n-gram language models for the Albanian language. There are presently no available Albanian language corpora to be used for probabilistic language modeling. Our technique attempts to augment spellchecking and typing systems by utilizing grapheme n-gram language models in improving suggestion accuracy in spellchecking and augmentative typing systems. Our technique can be implemented in a standalone tool or incorporated in another tool to offer additional selection/scoring criteria.

Download Full-text

Path-based knowledge reasoning with textual semantic information for medical knowledge graph completion

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01622-7 ◽

2021 ◽

Vol 21 (S9) ◽

Author(s):

Yinyu Lan ◽

Shizhu He ◽

Kang Liu ◽

Xiangrong Zeng ◽

Shengping Liu ◽

...

Keyword(s):

Semantic Information ◽

State Of The Art ◽

Semantic Representation ◽

Medical Knowledge ◽

The State ◽

Language Models ◽

Knowledge Graph ◽

Knowledge Reasoning ◽

Numerical Computing ◽

Knowledge Graphs

Abstract Background Knowledge graphs (KGs), especially medical knowledge graphs, are often significantly incomplete, so it necessitating a demand for medical knowledge graph completion (MedKGC). MedKGC can find new facts based on the existed knowledge in the KGs. The path-based knowledge reasoning algorithm is one of the most important approaches to this task. This type of method has received great attention in recent years because of its high performance and interpretability. In fact, traditional methods such as path ranking algorithm take the paths between an entity pair as atomic features. However, the medical KGs are very sparse, which makes it difficult to model effective semantic representation for extremely sparse path features. The sparsity in the medical KGs is mainly reflected in the long-tailed distribution of entities and paths. Previous methods merely consider the context structure in the paths of knowledge graph and ignore the textual semantics of the symbols in the path. Therefore, their performance cannot be further improved due to the two aspects of entity sparseness and path sparseness. Methods To address the above issues, this paper proposes two novel path-based reasoning methods to solve the sparsity issues of entity and path respectively, which adopts the textual semantic information of entities and paths for MedKGC. By using the pre-trained model BERT, combining the textual semantic representations of the entities and the relationships, we model the task of symbolic reasoning in the medical KG as a numerical computing issue in textual semantic representation. Results Experiments results on the publicly authoritative Chinese symptom knowledge graph demonstrated that the proposed method is significantly better than the state-of-the-art path-based knowledge graph reasoning methods, and the average performance is improved by 5.83% for all relations. Conclusions In this paper, we propose two new knowledge graph reasoning algorithms, which adopt textual semantic information of entities and paths and can effectively alleviate the sparsity problem of entities and paths in the MedKGC. As far as we know, it is the first method to use pre-trained language models and text path representations for medical knowledge reasoning. Our method can complete the impaired symptom knowledge graph in an interpretable way, and it outperforms the state-of-the-art path-based reasoning methods.

Download Full-text

Visual and Affective Grounding in Language and Mind

10.31234/osf.io/q97f8 ◽

2018 ◽

Cited By ~ 2

Author(s):

Simon De Deyne ◽

Danielle Navarro ◽

Guillem Collell ◽

Amy Perfors

Keyword(s):

Mental Representations ◽

Word Meaning ◽

Language Models ◽

Abstract Concepts ◽

Text Corpora ◽

Semantic Models ◽

Abstract Words ◽

Distributional Semantic Models ◽

Affective Information ◽

New Evidence

One of the main limitations in natural language-based approaches to meaning is that they are not grounded. In this study, we evaluate how well different kinds of models account for people’s representations of both concrete and abstract concepts. The models are both unimodal (language-based only) models and multimodal distributional semantic models (which additionallyincorporate perceptual and/or affective information). The language-based models include both external (based on text corpora) and internal (derived from word associations) language. We present two new studies and a re-analysis of a series of previous studies demonstrating that the unimodal performance is substantially higher for internal models, especially when comparisons at the basiclevel are considered. For multimodal models, our findings suggest that additional visual and affective features lead to only slightly more accurate mental representations of word meaning than what is already encoded in internal language models; however, for abstract concepts, visual andaffective features improve the predictions of external text-based models. Our work presents new evidence that the grounding problem includes abstract words as well and is therefore more widespread than previously suggested. Implications for both embodied and distributional views arediscussed.

Download Full-text

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Frontiers in Research Metrics and Analytics ◽

10.3389/frma.2021.689803 ◽

2021 ◽

Vol 6 ◽

Author(s):

Nona Naderi ◽

Julien Knafou ◽

Jenny Copara ◽

Patrick Ruch ◽

Douglas Teodoro

Keyword(s):

Life Science ◽

Named Entity Recognition ◽

Majority Voting ◽

Entity Recognition ◽

Language Models ◽

Free Text ◽

Specific Domain ◽

Named Entities ◽

Named Entity ◽

Text Corpora

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

Download Full-text

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

10.18653/v1/2021.naacl-main.45 ◽

2021 ◽

Author(s):

Michihiro Yasunaga ◽

Hongyu Ren ◽

Antoine Bosselut ◽

Percy Liang ◽

Jure Leskovec

Keyword(s):

Question Answering ◽

Language Models ◽

Knowledge Graphs

Download Full-text

Part-of-Speech Tagging

10.1093/oxfordhb/9780199276349.013.0011 ◽

2012 ◽

Author(s):

Atro Voutilainen

Keyword(s):

Markov Models ◽

Language Model ◽

Language Models ◽

Symbolic Language ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Text Corpora ◽

History Of ◽

General Architecture ◽

Speech Tagging

This article outlines the recently used methods for designing part-of-speech taggers; computer programs for assigning contextually appropriate grammatical descriptors to words in texts. It begins with the description of general architecture and task setting. It gives an overview of the history of tagging and describes the central approaches to tagging. These approaches are: taggers based on handwritten local rules, taggers based on n-grams automatically derived from text corpora, taggers based on hidden Markov models, taggers using automatically generated symbolic language models derived using methods from machine tagging, taggers based on handwritten global rules, and hybrid taggers, which combine the advantages of handwritten and automatically generated taggers. This article focuses on handwritten tagging rules. Well-tagged training corpora are a valuable resource for testing and improving language model. The text corpus reminds the grammarian about any oversight while designing a rule.

Download Full-text

KnowlyBERT - Hybrid Query Answering over Language Models and Knowledge Graphs

Lecture Notes in Computer Science - The Semantic Web – ISWC 2020 ◽

10.1007/978-3-030-62419-4_17 ◽

2020 ◽

pp. 294-310

Author(s):

Jan-Christoph Kalo ◽

Leandra Fichtel ◽

Philipp Ehler ◽

Wolf-Tilo Balke

Keyword(s):

Query Answering ◽

Language Models ◽

Knowledge Graphs

Download Full-text

Chapter 13. Generalizable Neuro-Symbolic Systems for Commonsense Question Answering

10.3233/faia210360 ◽

2021 ◽

Author(s):

Alessandro Oltramari ◽

Jonathan Francis ◽

Filip Ilievski ◽

Kaixin Ma ◽

Roshanak Mirzaee

Keyword(s):

Error Analysis ◽

Quantitative Evaluation ◽

Question Answering ◽

Language Models ◽

Language Understanding ◽

Symbolic Systems ◽

Benchmark Datasets ◽

Knowledge Graphs

This chapter illustrates how suitable neuro-symbolic models for language understanding can enable domain generalizability and robustness in downstream tasks. Different methods for integrating neural language models and knowledge graphs are discussed. The situations in which this combination is most appropriate are characterized, including quantitative evaluation and qualitative error analysis on a variety of commonsense question answering benchmark datasets.

Download Full-text