BERT meets LIWC: Exploring State-of-the-Art Language Models for Predicting Communication Behavior in Couples’ Conflict Interactions

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

Attending to Entities for Better Text Understanding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6254 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7554-7561

Author(s):

Pengxiang Cheng ◽

Katrin Erk

Keyword(s):

Large Scale ◽

Human Performance ◽

State Of The Art ◽

Syntactic Structure ◽

Semantic Knowledge ◽

Training Data ◽

Language Models ◽

Long Distance ◽

Future Directions ◽

Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trained language models (GPT, BERT, XLNet, etc.) based on Transformer (Vaswani et al. 2017), and in a range of end tasks, such models have achieved state-of-the-art results, approaching human performance. This clearly demonstrates the power of the stacked self-attention architecture when paired with a sufficient number of layers and a large amount of pre-training data. However, on tasks that require complex and long-distance reasoning where surface-level cues are not enough, there is still a large gap between the pre-trained models and human performance. Strubell et al. (2018) recently showed that it is possible to inject knowledge of syntactic structure into a model through supervised self-attention. We conjecture that a similar injection of semantic knowledge, in particular, coreference information, into an existing model would improve performance on such complex problems. On the LAMBADA (Paperno et al. 2016) task, we show that a model trained from scratch with coreference as auxiliary supervision for self-attention outperforms the largest GPT-2 model, setting the new state-of-the-art, while only containing a tiny fraction of parameters compared to GPT-2. We also conduct a thorough analysis of different variants of model architectures and supervision configurations, suggesting future directions on applying similar techniques to other problems.

Download Full-text

MSA Transformer

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text

Ascent of Pre-trained State-of-the-Art Language Models

Algorithms for Intelligent Systems - Advanced Computing Technologies and Applications ◽

10.1007/978-981-15-3242-9_26 ◽

2020 ◽

pp. 269-280

Author(s):

Keval Nagda ◽

Anirudh Mukherjee ◽

Milind Shah ◽

Pratik Mulchandani ◽

Lakshmi Kurup

Keyword(s):

State Of The Art ◽

Language Models

Download Full-text

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00112 ◽

2016 ◽

Vol 4 ◽

pp. 477-490 ◽

Cited By ~ 4

Author(s):

Ehsan Shareghi ◽

Matthias Petri ◽

Gholamreza Haffari ◽

Trevor Cohn

Keyword(s):

Infinite Order ◽

State Of The Art ◽

Language Model ◽

Language Models ◽

Memory Usage ◽

Suffix Trees ◽

Construction Time ◽

Modest Increase ◽

Order Language ◽

Language Modelling

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

Download Full-text

QASC: A Dataset for Question Answering via Sentence Composition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6319 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8082-8090

Author(s):

Tushar Khot ◽

Peter Clark ◽

Michal Guerquin ◽

Peter Jansen ◽

Ashish Sabharwal

Keyword(s):

Common Sense ◽

Human Performance ◽

Question Answering ◽

State Of The Art ◽

Multiple Choice ◽

Training Data ◽

Language Models ◽

Current State ◽

New Concepts ◽

Large Corpus

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Download Full-text

Improving word prediction for augmentative communication by using idiolects and sociolects

Dutch Journal of Applied Linguistics ◽

10.1075/dujal.3.2.03sto ◽

2014 ◽

Vol 3 (2) ◽

pp. 137-154 ◽

Cited By ~ 1

Author(s):

Wessel Stoop ◽

Antal van den Bosch

Keyword(s):

Data Collection ◽

State Of The Art ◽

Training Data ◽

Language Models ◽

Prediction System ◽

Word Prediction ◽

The People ◽

The Social ◽

Statistical Language Models ◽

Assistive Communication

Word prediction, or predictive editing, has a long history as a tool for augmentative and assistive communication. Improvements in the state-of-the-art can still be achieved, for instance by training personalized statistical language models. We developed the word prediction system Soothsayer. The main innovation of Soothsayer is that it not only uses idiolects, the language of one individual person, as training data, but also sociolects, the language of the social circle around that person. We use Twitter for data collection and experimentation. The idiolect models are based on individual Twitter feeds, the sociolect models are based on the tweets of a particular person and the tweets of the people he often communicates with. The sociolect approach achieved the best results. For a number of users, more than 50% of the keystrokes could have been saved if they had used Soothsayer.

Download Full-text

Text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning

10.31234/osf.io/293kt ◽

2021 ◽

Author(s):

Oscar Nils Erik Kjell ◽

H. Andrew Schwartz ◽

Salvatore Giorgi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rating Scale ◽

State Of The Art ◽

R Package ◽

Language Models ◽

Categorical Variables ◽

Human Language

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).

Download Full-text

Concept Recognition as a Machine Translation Problem

10.1101/2020.12.03.410829 ◽

2020 ◽

Author(s):

Mayla R Boguslav ◽

Negacy D Hailu ◽

Michael Bada ◽

William A Baumgartner ◽

Lawrence E Hunter

Keyword(s):

Machine Learning ◽

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Training Data ◽

Language Models ◽

Alternative Methods ◽

Automated Assignment ◽

Concept Recognition ◽

Alternative Approaches

AbstractBackgroundAutomated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models had the potential to outperform multi-class classification approaches. Here we systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning.ResultsWe report on our extensive studies of alternative methods and hyperparameter selections. The results not only identify the best-performing systems and parameters across a wide variety of ontologies but also illuminate about the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) for span detection (as previously found) along with the Open-source Toolkit for Neural Machine Translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies in CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches.ConclusionsMachine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT Shared Task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation.

Download Full-text

LMPred: Predicting Antimicrobial Peptides Using Pre-Trained Language Models and Deep Learning

10.1101/2021.11.03.467066 ◽

2021 ◽

Author(s):

William Dee

Keyword(s):

Antimicrobial Peptides ◽

Bacterial Resistance ◽

State Of The Art ◽

Language Models ◽

Peptide Sequence ◽

Independent Dataset ◽

Therapeutic Drugs ◽

Novel Approach ◽

Previous State ◽

In Silico Models

Antimicrobial peptides (AMPs) are increasingly being used in the development of new therapeutic drugs, in areas such as cancer therapy and hypertension. Additionally, they are seen as an alternative to antibiotics due to the increasing occurrence of bacterial resistance. Wet-laboratory experimental identification, however, is both time consuming and costly, so in-silico models are now commonly used in order to screen new AMP candidates. This paper proposes a novel approach of creating model inputs; using pre-trained language models to produce contextualized embeddings representing the amino acids within each peptide sequence, before a convolutional neural network is then trained as the classifier. The optimal model was validated on two datasets, being one previously used in AMP prediction research, and an independent dataset, created by this paper. Predictive accuracies of 93.33% and 88.26% were achieved respectively, outperforming all previous state-of-the-art classification models.

Download Full-text