Semantic programming by example with pre-trained models

Gust Verbruggen; Vu Le; Sumit Gulwani

doi:10.1145/3485477

Semantic programming by example with pre-trained models

Proceedings of the ACM on Programming Languages ◽

10.1145/3485477 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-25

Author(s):

Gust Verbruggen ◽

Vu Le ◽

Sumit Gulwani

Keyword(s):

Syntactic Structure ◽

Language Model ◽

Expressive Power ◽

Language Models ◽

Inductive Synthesis ◽

Inductive Programming ◽

The Given ◽

Learning Language ◽

Semantic Operators

The ability to learn programs from few examples is a powerful technology with disruptive applications in many domains, as it allows users to automate repetitive tasks in an intuitive way. Existing frameworks on inductive synthesis only perform syntactic manipulations, where they rely on the syntactic structure of the given examples and not their meaning. Any semantic manipulations, such as transforming dates, have to be manually encoded by the designer of the inductive programming framework. Recent advances in large language models have shown these models to be very adept at performing semantic transformations of its input by simply providing a few examples of the task at hand. When it comes to syntactic transformations, however, these models are limited in their expressive power. In this paper, we propose a novel framework for integrating inductive synthesis with few-shot learning language models to combine the strength of these two popular technologies. In particular, the inductive synthesis is tasked with breaking down the problem in smaller subproblems, among which those that cannot be solved syntactically are passed to the language model. We formalize three semantic operators that can be integrated with inductive synthesizers. To minimize invoking expensive semantic operators during learning, we introduce a novel deferred query execution algorithm that considers the operators to be oracles during learning. We evaluate our approach in the domain of string transformations: the combination methodology can automate tasks that cannot be handled using either technologies by themselves. Finally, we demonstrate the generality of our approach via a case study in the domain of string profiling.

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

Automatic Identification of Close Languages - Case study: Malay and Indonesian

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.200622.53288 ◽

1970 ◽

Vol 2 (2) ◽

pp. 126-134 ◽

Cited By ~ 5

Author(s):

Bali Ranaivo-Malancon

Keyword(s):

Language Model ◽

Language Models ◽

Automatic Identification ◽

Input Text ◽

Current Program

Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian as many other languages are very similar, and therefore it is a real difficulty to search, retrieve, classify, and above all translate texts written in one of the two languages. We have built a language identifier todetermine whether the text is written in Malay or Indonesian which could be used in any similar situation. It uses the frequency and rank of trigrams of characters, the lists of exclusive words, and the format of numbers. The trigrams are derived from the most frequent words in each language. The current program contains as language models: Malay/Indonesian (661 trigrams), Dutch (826 trigrams), English (652 trigrams), French (579 trigrams), and German (482 trigrams). The trigrams of an unknown text are searched in each language model. The language of the input text is the language having the highest ratio in “number of shared trigrams / total number of trigrams” and “number of winner trigrams / number of shared trigrams”. If the language found at trigram search level is ’Malay or Indonesian’, the text is then scanned by searching the format of numbers and ofsome exclusive words.

Download Full-text

Which Sentence Embeddings and Which Layers Encode Syntactic Structure?

10.31234/osf.io/9jsnz ◽

2020 ◽

Author(s):

M. Alex Kelly ◽

Yang Xu ◽

Jesús Calvillo ◽

David Reitter

Keyword(s):

Syntactic Structure ◽

Dimensional Space ◽

Language Model ◽

Empirical Support ◽

Language Models ◽

Double Object ◽

Different Types ◽

Prepositional Object ◽

High Degree ◽

Shed Light

Recent models of language have eliminated syntactic-semantic dividing lines. We explore the psycholinguistic implications of this development by comparing different types of sentence embeddings in their ability to encode syntactic constructions. Our study uses contrasting sentence structures known to cause syntactic priming effects, that is, the tendency in humans to re- peat sentence structures after recent exposure. We compare how syntactic alternatives are captured by sentence embed- dings produced by a neural language model (BERT) or by the composition of word embeddings (BEAGLE, HHM, GloVe). Dative double object vs. prepositional object and active vs. passive sentences are separable in the high-dimensional space of the sentence embeddings and can be classified with a high degree of accuracy. The results lend empirical support to the modern, computational, integrated accounts of semantics and syntax, and they shed light on the information stored at different layers in deep language models such as BERT.

Download Full-text

HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

EurasianUnionScientists ◽

10.31618/esu.2413-9335.2021.1.86.1349 ◽

2021 ◽

pp. 15-18

Author(s):

A. Evtushenko

Keyword(s):

Natural Language ◽

Language Processing ◽

Text Processing ◽

Language Model ◽

Personal Data ◽

Language Models ◽

Training Dataset ◽

Critical Information ◽

Research Company ◽

Learning Language

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP). In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB. This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.

Download Full-text

Language Models for the Prediction of SARS-CoV-2 Inhibitors

10.1101/2021.12.10.471928 ◽

2021 ◽

Author(s):

Andrew E Blanchard ◽

John Gounley ◽

Debsindhu Bhowmik ◽

Mayanka Chandra Shekar ◽

Isaac Lyngaas ◽

...

Keyword(s):

Deep Learning ◽

Binding Affinity ◽

Language Model ◽

Specific Protein ◽

Language Models ◽

Peak Performance ◽

Protein Targets ◽

Training Time ◽

Mixed Precision ◽

Learning Language

The COVID-19 pandemic highlights the need for computational tools to automate and accelerate drug design for novel protein targets. We leverage deep learning language models to generate and score drug candidates based on predicted protein binding affinity. We pre-trained a deep learning language model (BERT) on ~9.6 billion molecules and achieved peak performance of 603 petaflops in mixed precision. Our work reduces pre-training time from days to hours, compared to previous efforts with this architecture, while also increasing the dataset size by nearly an order of magnitude. For scoring, we fine-tuned the language model using an assembled set of thousands of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. We utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model. Our generalizable models accelerate the identification of inhibitors for emerging therapeutic targets.

Download Full-text

Using Pre-trained Language Model to Enhance Active Learning for Sentence Matching

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3480937 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-19

Author(s):

Guirong Bai ◽

Shizhu He ◽

Kang Liu ◽

Jun Zhao

Keyword(s):

Active Learning ◽

Language Model ◽

Experimental Results ◽

Language Models ◽

Data Driven ◽

Learning Approach ◽

Sentence Matching ◽

Learning Language

Active learning is an effective method to substantially alleviate the problem of expensive annotation cost for data-driven models. Recently, pre-trained language models have been demonstrated to be powerful for learning language representations. In this article, we demonstrate that the pre-trained language model can also utilize its learned textual characteristics to enrich criteria of active learning. Specifically, we provide extra textual criteria with the pre-trained language model to measure instances, including noise, coverage, and diversity. With these extra textual criteria, we can select more efficient instances for annotation and obtain better results. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that the proposed active learning approach can be enhanced by the pre-trained language model and obtain better performance.

Download Full-text

A Generalized Language Model in Tensor Space

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017450 ◽

2019 ◽

Vol 33 ◽

pp. 7450-7458 ◽

Cited By ~ 2

Author(s):

Lipeng Zhang ◽

Peng Zhang ◽

Xindian Ma ◽

Shuqin Gu ◽

Zhan Su ◽

...

Keyword(s):

Language Model ◽

Tensor Decomposition ◽

Expressive Power ◽

Semantic Space ◽

Language Models ◽

Tensor Representation ◽

Order Tensor ◽

Tensor Space ◽

Recursive Calculation ◽

N Gram

In the literature, tensors have been effectively used for capturing the context information in language models. However, the existing methods usually adopt relatively-low order tensors, which have limited expressive power in modeling language. Developing a higher-order tensor representation is challenging, in terms of deriving an effective solution and showing its generality. In this paper, we propose a language model named Tensor Space Language Model (TSLM), by utilizing tensor networks and tensor decomposition. In TSLM, we build a high-dimensional semantic space constructed by the tensor product of word vectors. Theoretically, we prove that such tensor representation is a generalization of the n-gram language model. We further show that this high-order tensor representation can be decomposed to a recursive calculation of conditional probability for language modeling. The experimental results on Penn Tree Bank (PTB) dataset and WikiText benchmark demonstrate the effectiveness of TSLM.

Download Full-text

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

Sensors ◽

10.3390/s21010133 ◽

2020 ◽

Vol 21 (1) ◽

pp. 133

Author(s):

Marco Pota ◽

Mirko Ventura ◽

Rosario Catelli ◽

Massimo Esposito

Keyword(s):

Sentiment Analysis ◽

Language Model ◽

Language Models ◽

Plain Text ◽

Analysis Techniques ◽

Academic Communities ◽

Text Corpora ◽

General Basis ◽

Model Training

Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.

Download Full-text

Pengembangan Model Pembelajaran Whole Languege Untuk Menumbuh Kembangkan Kemampuan Berbahasa Tulis Siswa Sekolah Dasar

EduHumaniora | Jurnal Pendidikan Dasar Kampus Cibiru ◽

10.17509/eh.v3i1.2799 ◽

2016 ◽

Vol 3 (1) ◽

Author(s):

Lely Halimah ◽

Didin Syahrudin ◽

Yunus Abidin

Keyword(s):

Primary School ◽

Whole Language ◽

Teaching And Learning ◽

Language Model ◽

Language Models ◽

Elementary Student ◽

Process Step ◽

Learning And Teaching ◽

Teaching Language ◽

Learning Language

AbstractSome mistake are finding in teaching and learning language at the schools until today. This condition happened because there is some mistake in teaching and learning language at the schools. Because of that, the process of teaching and learning literature at schools must be change by using the effective models of teaching and learning language which can improve student literacy. One of the models for teaching and learning literacy is Whole Language Model. The problems of this research are (1) what is real condition of learning and teaching language process in the schools? (2) how to development whole language models which can improve student’s literacy? (3) are there influence in using whole language model toward student’s literacy improvement? .Method of this research is research and development. The subject of this research is student in Primary School Laboratory UPI Cibiru Campus. Measurement utilizes test and observation. The statistic analysis technique is using to analysis the date. Result of this research concludes that (1) teacher needs the good model of learning and teaching language to improving student’s literacy in the school; (2) whole language model can implementating by three step that is introduction step, process step, and closing step; (3) implementation of whole language model of teaching and learning language can improving student’s literacy in the primary school.Key Word: Whole Language, Elementary Student, Literacy

Download Full-text

Literary Representations of Female Identity

American Journal of Islam and Society ◽

10.35632/ajis.v19i4.1914 ◽

2002 ◽

Vol 19 (4) ◽

pp. 23-41

Author(s):

Safoi Babana-Hampton

Keyword(s):