Viterbi decoding for latent words language models using gibbs sampling

AbstractRecently developed language models (LMs) based on deep neural networks have demonstrated the ability to generate fluent natural language text. LMs pre-trained on protein sequences have shown state of the art performance on a variety of downstream tasks. Protein LMs have also been used to generate novel protein sequences. In the present work we use Gibbs sampling of BERT-style LMs, pre-trained on protein sequences using the masked language modeling task, to generate novel protein sequences. We evaluate the quality of the generated sequences by comparing them to natural sequences from the same family. In particular, we focus on proteins from the chorismate mutase type II family, which has been used in previous work as an example target for protein generative models. We find that the Gibbs sampling process on BERT-style models pretrained on millions to billions of protein sequences is able to generate novel sequences that retain key features of related natural sequences. Further, we find that smaller models fine-tuned or trained from scratch on family-specific data are able to equal or surpass the generation quality of large pre-trained models by some metrics. The ability to generate novel natural-like protein sequences could contribute to the development of improved protein therapeutics and protein-catalysts for industrial chemical production.

Download Full-text

Statistical Language Models for Information Retrieval A Critical Review

10.1561/9781601981875 ◽

2007 ◽

Cited By ~ 4

Author(s):

ChengXiang Zhai

Keyword(s):

Information Retrieval ◽

Critical Review ◽

Language Models ◽

Statistical Language Models

Download Full-text

A Trace-Back Method with Source States for Viterbi Decoding of Rate-1/n Convolutional Codes

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e95.a.767 ◽

2012 ◽

Vol E95-A (4) ◽

pp. 767-775

Author(s):

Kazuhito ITO

Keyword(s):

Convolutional Codes ◽

Viterbi Decoding ◽

Trace Back

Download Full-text

Adolescent Language: Models, Assessment, and Links to Reading

10.35542/osf.io/pf5y8 ◽

2019 ◽

Cited By ~ 1

Author(s):

Amanda Goodwin ◽

Yaacov Petscher ◽

Jamie Tock

Keyword(s):

Reading Comprehension ◽

Bifactor Model ◽

Language Models ◽

Multiple Group ◽

Global Factor ◽

Eighth Grade Students ◽

Key Aspects ◽

Future Work ◽

The Relationship ◽

Best Fit

Various models have highlighted the complexity of language. Building on foundational ideas regarding three key aspects of language, our study contributes to the literature by 1) exploring broader conceptions of morphology, vocabulary, and syntax, 2) operationalizing this theoretical model into a gamified, standardized, computer-adaptive assessment of language for fifth to eighth grade students entitled Monster, PI, and 3) uncovering further evidence regarding the relationship between language and standardized reading comprehension via this assessment. Multiple-group item response theory (IRT) across grades show that morphology was best fit by a bifactor model of task specific factors along with a global factor related to each skill. Vocabulary was best fit by a bifactor model that identifies performance overall and on specific words. Syntax, though, was best fit by a unidimensional model. Next, Monster, PI produced reliable scores suggesting language can be assessed efficiently and precisely for students via this model. Lastly, performance on Monster, PI explained more than 50% of variance in standardized reading, suggesting operationalizing language via Monster, PI can provide meaningful understandings of the relationship between language and reading comprehension. Specifically, considering just a subset of a construct, like identification of units of meaning, explained significantly less variance in reading comprehension. This highlights the importance of considering these broader constructs. Implications indicate that future work should consider a model of language where component areas are considered broadly and contributions to reading comprehension are explored via general performance on components as well as skill level performance.

Download Full-text