cs60075_team2 at SemEval-2021 Task 1 : Lexical Complexity Prediction using Transformer-based Language Models pre-trained on various text corpora

Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets’ features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events.

Download Full-text

USING GRAPHEME n-GRAMS IN SPELLING CORRECTION AND AUGMENTATIVE TYPING SYSTEMS

New Mathematics and Natural Computation ◽

10.1142/s1793005708000970 ◽

2008 ◽

Vol 04 (01) ◽

pp. 87-106

Author(s):

ALKET MEMUSHAJ ◽

TAREK M. SOBH

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Computational Efficiency ◽

Probabilistic Models ◽

Language Modeling ◽

Language Models ◽

Text Corpora ◽

N Gram ◽

Changes Over Time ◽

Over Time

Probabilistic language models have gained popularity in Natural Language Processing due to their ability to successfully capture language structures and constraints with computational efficiency. Probabilistic language models are flexible and easily adapted to language changes over time as well as to some new languages. Probabilistic language models can be trained and their accuracy strongly related to the availability of large text corpora. In this paper, we investigate the usability of grapheme probabilistic models, specifically grapheme n-grams models in spellchecking as well as augmentative typing systems. Grapheme n-gram models require substantially smaller training corpora and that is one of the main drivers for this thesis in which we build grapheme n-gram language models for the Albanian language. There are presently no available Albanian language corpora to be used for probabilistic language modeling. Our technique attempts to augment spellchecking and typing systems by utilizing grapheme n-gram language models in improving suggestion accuracy in spellchecking and augmentative typing systems. Our technique can be implemented in a standalone tool or incorporated in another tool to offer additional selection/scoring criteria.

Download Full-text

JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-trained Language Models

10.18653/v1/2021.semeval-1.85 ◽

2021 ◽

Author(s):

Tuqa Bani Yaseen ◽

Qusai Ismail ◽

Sarah Al-Omari ◽

Eslam Al-Sobh ◽

Malak Abdullah

Keyword(s):

Language Models ◽

Lexical Complexity

Download Full-text

Visual and Affective Grounding in Language and Mind

10.31234/osf.io/q97f8 ◽

2018 ◽

Cited By ~ 2

Author(s):

Simon De Deyne ◽

Danielle Navarro ◽

Guillem Collell ◽

Amy Perfors

Keyword(s):

Mental Representations ◽

Word Meaning ◽

Language Models ◽

Abstract Concepts ◽

Text Corpora ◽

Semantic Models ◽

Abstract Words ◽

Distributional Semantic Models ◽

Affective Information ◽

New Evidence

One of the main limitations in natural language-based approaches to meaning is that they are not grounded. In this study, we evaluate how well different kinds of models account for people’s representations of both concrete and abstract concepts. The models are both unimodal (language-based only) models and multimodal distributional semantic models (which additionallyincorporate perceptual and/or affective information). The language-based models include both external (based on text corpora) and internal (derived from word associations) language. We present two new studies and a re-analysis of a series of previous studies demonstrating that the unimodal performance is substantially higher for internal models, especially when comparisons at the basiclevel are considered. For multimodal models, our findings suggest that additional visual and affective features lead to only slightly more accurate mental representations of word meaning than what is already encoded in internal language models; however, for abstract concepts, visual andaffective features improve the predictions of external text-based models. Our work presents new evidence that the grounding problem includes abstract words as well and is therefore more widespread than previously suggested. Implications for both embodied and distributional views arediscussed.

Download Full-text

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Frontiers in Research Metrics and Analytics ◽

10.3389/frma.2021.689803 ◽

2021 ◽

Vol 6 ◽

Author(s):

Nona Naderi ◽

Julien Knafou ◽

Jenny Copara ◽

Patrick Ruch ◽

Douglas Teodoro

Keyword(s):

Life Science ◽

Named Entity Recognition ◽

Majority Voting ◽

Entity Recognition ◽

Language Models ◽

Free Text ◽

Specific Domain ◽

Named Entities ◽

Named Entity ◽

Text Corpora

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

Download Full-text

Part-of-Speech Tagging

10.1093/oxfordhb/9780199276349.013.0011 ◽

2012 ◽

Author(s):

Atro Voutilainen

Keyword(s):

Markov Models ◽

Language Model ◽

Language Models ◽

Symbolic Language ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Text Corpora ◽

History Of ◽

General Architecture ◽

Speech Tagging

This article outlines the recently used methods for designing part-of-speech taggers; computer programs for assigning contextually appropriate grammatical descriptors to words in texts. It begins with the description of general architecture and task setting. It gives an overview of the history of tagging and describes the central approaches to tagging. These approaches are: taggers based on handwritten local rules, taggers based on n-grams automatically derived from text corpora, taggers based on hidden Markov models, taggers using automatically generated symbolic language models derived using methods from machine tagging, taggers based on handwritten global rules, and hybrid taggers, which combine the advantages of handwritten and automatically generated taggers. This article focuses on handwritten tagging rules. Well-tagged training corpora are a valuable resource for testing and improving language model. The text corpus reminds the grammarian about any oversight while designing a rule.

Download Full-text

Predicting Human Similarity Judgments with Distributional Models: The Value of Word Associations

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/671 ◽

2017 ◽

Cited By ~ 3

Author(s):

Simon De Deyne ◽

Amy Perfors ◽

Daniel J. Navarro

Keyword(s):

Word Meaning ◽

Language Models ◽

Language Resources ◽

Data Set ◽

Word Associations ◽

Text Corpora ◽

Current State ◽

Distributional Properties ◽

Alternative Approaches ◽

Similarity Judgments

To represent the meaning of a word, most models use external language resources, such as text corpora, to derive the distributional properties of word usage. In this study, we propose that internal language models, that are more closely aligned to the mental representations of words, can be used to derive new theoretical questions regarding the structure of the mental lexicon. A comparison with internal models also puts into perspective a number of assumptions underlying recently proposed distributional text-based models could provide important insights into cognitive science, including linguistics and artificial intelligence. We focus on word-embedding models which have been proposed to learn aspects of word meaning in a manner similar to humans and contrast them with internal language models derived from a new extensive data set of word associations. An evaluation using relatedness judgments shows that internal language models consistently outperform current state-of-the art text-based external language models. This suggests alternative approaches to represent word meaning using properties that aren't encoded in text.

Download Full-text

UNBNLP at SemEval-2021 Task 1: Predicting lexical complexity with masked language models and character-level encoders

10.18653/v1/2021.semeval-1.83 ◽

2021 ◽

Author(s):

Milton King ◽

Ali Hakimi Parizi ◽

Samin Fakharian ◽

Paul Cook

Keyword(s):

Language Models ◽

Lexical Complexity

Download Full-text

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

Sensors ◽

10.3390/s21010133 ◽

2020 ◽

Vol 21 (1) ◽

pp. 133

Author(s):

Marco Pota ◽

Mirko Ventura ◽

Rosario Catelli ◽

Massimo Esposito

Keyword(s):

Sentiment Analysis ◽

Language Model ◽

Language Models ◽

Plain Text ◽

Analysis Techniques ◽

Academic Communities ◽

Text Corpora ◽

General Basis ◽

Model Training

Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.

Download Full-text

cs60075_team2 at SemEval-2021 Task 1 : Lexical Complexity Prediction using Transformer-based Language Models pre-trained on various text corpora

Building Hybrid Representations from Text Corpora, Knowledge Graphs, and Language Models

Exploiting Language Models to Classify Events from Twitter

USING GRAPHEME n-GRAMS IN SPELLING CORRECTION AND AUGMENTATIVE TYPING SYSTEMS

JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-trained Language Models

Visual and Affective Grounding in Language and Mind

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Part-of-Speech Tagging

Predicting Human Similarity Judgments with Distributional Models: The Value of Word Associations

UNBNLP at SemEval-2021 Task 1: Predicting lexical complexity with masked language models and character-level encoders

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

Export Citation Format