MSA Transformer

Mapping Intimacies ◽

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text

Generating Sentences by Editing Prototypes

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00030 ◽

2018 ◽

Vol 6 ◽

pp. 437-450 ◽

Cited By ~ 10

Author(s):

Kelvin Guu ◽

Tatsunori B. Hashimoto ◽

Yonatan Oren ◽

Percy Liang

Keyword(s):

Language Model ◽

Language Modeling ◽

Language Models ◽

Training Corpus ◽

Human Evaluation ◽

Sentence Level ◽

Sentence Similarity ◽

Traditional Language ◽

Generative Language

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.

Download Full-text

Dynamic Language Models for Streaming Text

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00175 ◽

2014 ◽

Vol 2 ◽

pp. 181-192 ◽

Cited By ~ 6

Author(s):

Dani Yogatama ◽

Chong Wang ◽

Bryan R. Routledge ◽

Noah A. Smith ◽

Eric P. Xing

Keyword(s):

Social Media ◽

Temporal Dynamics ◽

Language Model ◽

Language Modeling ◽

Streaming Data ◽

Language Models ◽

Linguistic Context ◽

Text Data ◽

Competing Models ◽

Context Features

We present a probabilistic language model that captures temporal dynamics and conditions on arbitrary non-linguistic context features. These context features serve as important indicators of language changes that are otherwise difficult to capture using text data by itself. We learn our model in an efficient online fashion that is scalable for large, streaming data. With five streaming datasets from two different genres—economics news articles and social media—we evaluate our model on the task of sequential language modeling. Our model consistently outperforms competing models.

Download Full-text

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00112 ◽

2016 ◽

Vol 4 ◽

pp. 477-490 ◽

Cited By ~ 4

Author(s):

Ehsan Shareghi ◽

Matthias Petri ◽

Gholamreza Haffari ◽

Trevor Cohn

Keyword(s):

Infinite Order ◽

State Of The Art ◽

Language Model ◽

Language Models ◽

Memory Usage ◽

Suffix Trees ◽

Construction Time ◽

Modest Increase ◽

Order Language ◽

Language Modelling

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

Download Full-text

QASC: A Dataset for Question Answering via Sentence Composition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6319 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8082-8090

Author(s):

Tushar Khot ◽

Peter Clark ◽

Michal Guerquin ◽

Peter Jansen ◽

Ashish Sabharwal

Keyword(s):

Common Sense ◽

Human Performance ◽

Question Answering ◽

State Of The Art ◽

Multiple Choice ◽

Training Data ◽

Language Models ◽

Current State ◽

New Concepts ◽

Large Corpus

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Download Full-text

Editor's note

Language Teaching ◽

10.1017/s0261444808005132 ◽

2008 ◽

Vol 41 (3) ◽

pp. i-ii

Keyword(s):

Native Speaker ◽

State Of The Art ◽

Social Recognition ◽

World Englishes ◽

Future Research ◽

Attitudes And Beliefs ◽

Current State ◽

English Speaking ◽

Native English Speaking Teachers ◽

And Function

In this issue's state-of-the-art article, Lucie moussu and Enric Llurda discuss research on non-native English-speaking teachers of English, highlighting throughout the need for more considered social recognition of the native-speaker/non-native-speaker identity. After discussing the current legitimacy of such labels in the light of research, they argue for a more reasoned approach both to the definition and function of non-native-speaker teachers, in particular in light of recent research on World Englishes. Particular attention is paid to the perception of the non-native and native-speaker teachers by students and to the attitudes and beliefs of both these students and hiring staff regarding such teachers' perceived differences, strengths, and weaknesses. In the final part of the paper the authors address past and present research methods used in studies and suggest areas, such as longitudinal and classroom-based studies, where future research might usefully add to the current state of knowledge. The article is accompanied by Amir Soheili-Mehr's review of four recent books.

Download Full-text

Transformer protein language models are unsupervised structure learners

10.1101/2020.12.15.422761 ◽

2020 ◽

Author(s):

Roshan M Rao ◽

Joshua Meier ◽

Tom Sercu ◽

Sergey Ovchinnikov ◽

Alexander Rives

Keyword(s):

Protein Structure ◽

State Of The Art ◽

Protein Structure Determination ◽

Language Modeling ◽

Language Models ◽

Evolutionary Constraints ◽

Functional Constraints ◽

Contact Prediction ◽

The Past ◽

Potential Alternative

Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.

Download Full-text

Protein language model embeddings for fast, accurate, alignment-free protein structure prediction

10.1101/2021.07.31.454572 ◽

2021 ◽

Author(s):

Konstantin Weissenow ◽

Michael Heinzinger ◽

Burkhard Rost

Keyword(s):

Protein Structure ◽

Structure Prediction ◽

Prediction Models ◽

Language Model ◽

Structural Features ◽

Language Models ◽

Evolutionary Information ◽

Major Advance ◽

Sequence Alignments ◽

Multiple Sequence

All state-of-the-art (SOTA) protein structure predictions rely on evolutionary information captured in multiple sequence alignments (MSAs), primarily on evolutionary couplings (co-evolution). Such information is not available for all proteins and is computationally expensive to generate. Prediction models based on Artificial Intelligence (AI) using only single sequences as input are easier and cheaper but perform so poorly that speed becomes irrelevant. Here, we described the first competitive AI solution exclusively inputting embeddings extracted from pre-trained protein Language Models (pLMs), namely from the transformer pLM ProtT5, from single sequences into a relatively shallow (few free parameters) convolutional neural network (CNN) trained on inter-residue distances, i.e. protein structure in 2D. The major advance originated from processing the attention heads learned by ProtT5. Although these models required at no point any MSA, they matched the performance of methods relying on co-evolution. Although not reaching the very top, our lean approach came close at substantially lower costs thereby speeding up development and each future prediction. By generating protein-specific rather than family-averaged predictions, these new solutions could distinguish between structural features differentiating members of the same family of proteins with similar structure predicted alike by all other top methods.

Download Full-text

Proteochemometric Models Using Multiple Sequence Alignments and a Subword Segmented Masked Language Model

10.26434/chemrxiv.14604720 ◽

2021 ◽

Author(s):

Héléna Alexandra Gaspar ◽

Mohamed Ahmed ◽

Thomas Edlich ◽

Benedek Fabian ◽

Zsolt Varszegi ◽

...

Keyword(s):

Language Processing ◽

Language Model ◽

Predictive Performance ◽

Language Models ◽

Cytochrome P450 Enzymes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Combine Information ◽

Protein Datasets

<div>Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences. </div><div>Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.</div>

Download Full-text

Supervised and Unsupervised Neural Approaches to Text Readability

Computational Linguistics ◽

10.1162/coli_a_00398 ◽

2021 ◽

Vol 47 (1) ◽

pp. 141-179

Author(s):

Matej Martinc ◽

Senja Pollak ◽

Marko Robnik-Šikonja

Keyword(s):

State Of The Art ◽

Comprehensive Analysis ◽

Language Models ◽

Feature Engineering ◽

Data Sets ◽

Data Set ◽

Systematic Comparison ◽

Current State ◽

Text Readability ◽

Unsupervised Approach

Abstract We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages, and allows adaptation to a specific readability task and data set. By systematic comparison of several neural architectures on a number of benchmark and new labeled readability data sets in two languages, this study also offers a comprehensive analysis of different neural approaches to readability classification. We expose their strengths and weaknesses, compare their performance to current state-of-the-art classification approaches to readability, which in most cases still rely on extensive feature engineering, and propose possibilities for improvements.

Download Full-text

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016843 ◽

2019 ◽

Vol 33 ◽

pp. 6843-6850 ◽

Cited By ~ 1

Author(s):

Sebastian J. Mielke ◽

Jason Eisner

Keyword(s):

State Of The Art ◽

Language Model ◽

Word Type ◽

Language Modeling ◽

Generative Model ◽

Sentence Structure ◽

Word Structure ◽

Multiple Datasets ◽

Combining Evidence ◽

Unknown Words

We show how the spellings of known words can help us deal with unknown words in open-vocabulary NLP tasks. The method we propose can be used to extend any closedvocabulary generative model, but in this paper we specifically consider the case of neural language modeling. Our Bayesian generative story combines a standard RNN language model (generating the word tokens in each sentence) with an RNNbased spelling model (generating the letters in each word type). These two RNNs respectively capture sentence structure and word structure, and are kept separate as in linguistics. By invoking the second RNN to generate spellings for novel words in context, we obtain an open-vocabulary language model. For known words, embeddings are naturally inferred by combining evidence from type spelling and token context. Comparing to baselines (including a novel strong baseline), we beat previous work and establish state-of-the-art results on multiple datasets.

Download Full-text