Astrid

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

AI ◽

10.3390/ai2010001 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-16

Author(s):

Juan Cruz-Benito ◽

Sanjay Vishwakarma ◽

Francisco Martin-Fernandez ◽

Ismael Faro

Keyword(s):

Deep Learning ◽

Learning Community ◽

Programming Languages ◽

Language Processing ◽

Code Generation ◽

Language Model ◽

Language Models ◽

Stochastic Gradient Descent ◽

Network Architectures ◽

Learning Architectures

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Download Full-text

KM-BERT: A Pre-trained BERT for Korean Medical Natural Language Processing (Preprint)

10.2196/preprints.31223 ◽

2021 ◽

Author(s):

Yoojoong Kim ◽

Jeong Moon Lee ◽

Moon Joung Jang ◽

Yun Jin Yum ◽

Jong-Ho Kim ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pearson Correlation ◽

Language Model ◽

Language Models ◽

Korean Language ◽

Medical Texts ◽

Proposed Model

BACKGROUND With advances in deep learning and natural language processing, analyzing medical texts is becoming increasingly important. Nonetheless, a study on medical-specific language models has not yet been conducted given the importance of medical texts. OBJECTIVE Korean medical text is highly difficult to analyze because of the agglutinative characteristics of the language as well as the complex terminologies in the medical domain. To solve this problem, we collected a Korean medical corpus and used it to train language models. METHODS In this paper, we present a Korean medical language model based on deep learning natural language processing. The proposed model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model. RESULTS After pre-training, the proposed method showed increased accuracies of 0.147 and 0.148 for the masked language model with next sentence prediction. In the intrinsic evaluation, the next sentence prediction accuracy improved by 0.258, which is a remarkable enhancement. In addition, the extrinsic evaluation of Korean medical semantic textual similarity data showed a 0.046 increase in the Pearson correlation. CONCLUSIONS The results demonstrated the superiority of the proposed model for Korean medical natural language processing. We expect that our proposed model can be extended for application to various languages and domains.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

Senti-BAS: A BERT-based model with sentiment computing for happiness research (Preprint)

10.2196/preprints.27914 ◽

2021 ◽

Author(s):

Zeyuan Zeng ◽

Yijia Zhang ◽

Liang Yang ◽

Hongfei Lin

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Language Processing ◽

High Accuracy ◽

Language Models ◽

Fine Grained ◽

Label Information ◽

Common Criterion ◽

Text Content ◽

Sentiment Computing

BACKGROUND Happiness becomes a rising topic that we all care about recently. It can be described in various forms. For the text content, it is an interesting subject that we can do research on happiness by utilizing natural language processing (NLP) methods. OBJECTIVE As an abstract and complicated emotion, there is no common criterion to measure and describe happiness. Therefore, researchers are creating different models to study and measure happiness. METHODS In this paper, we present a deep-learning based model called Senti-BAS (BERT embedded Bi-LSTM with self-Attention mechanism along with the Sentiment computing). RESULTS Given a sentence that describes how a person felt happiness recently, the model can classify the happiness scenario in the sentence with two topics: was it controlled by the author (label ‘agency’), and was it involving other people (label ‘social’). Besides language models, we employ the label information through sentiment computing based on lexicon. CONCLUSIONS The model performs with a high accuracy on both ‘agency’ and ‘social’ labels, and we also make comparisons with several popular embedding models like Elmo, GPT. Depending on our work, we can study the happiness at a more fine-grained level.

Download Full-text

Self-Supervised Contextual Data Augmentation for Natural Language Processing

Symmetry ◽

10.3390/sym11111393 ◽

2019 ◽

Vol 11 (11) ◽

pp. 1393

Author(s):

Dongju Park ◽

Chang Wook Ahn

Keyword(s):

Supervised Learning ◽

Language Processing ◽

Recurrent Neural Networks ◽

Question Answering ◽

Data Augmentation ◽

Language Model ◽

Contextual Data ◽

External Data ◽

Label Information ◽

Benchmark Datasets

In this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-supervised learning, we can employ the masked language model (MLM), which masks a specific word within a sentence and obtains the original word. The MLM learns the context of a sentence through asymmetrical inputs and outputs. However, without using the existing MLM, we propose a label-masked language model (LMLM) that can include label information for the mask tokens used in the MLM to effectively use the MLM in data with label information. The augmentation method performs self-supervised learning using LMLM and then implements data augmentation through the trained model. We demonstrate that our proposed method improves the classification accuracy of recurrent neural networks and convolutional neural network-based classifiers through several experiments for text classification benchmark datasets, including the Stanford Sentiment Treebank-5 (SST5), the Stanford Sentiment Treebank-2 (SST2), the subjectivity (Subj), the Multi-Perspective Question Answering (MPQA), the Movie Reviews (MR), and the Text Retrieval Conference (TREC) datasets. In addition, since the proposed method does not use external data, it can eliminate the time spent collecting external data, or pre-training using external data.

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

COVID-19 sentiment analysis via deep learning during the rise of novel cases

PLoS ONE ◽

10.1371/journal.pone.0255615 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255615 ◽

Cited By ~ 1

Author(s):

Rohitash Chandra ◽

Aswin Krishna

Keyword(s):

Deep Learning ◽

Sentiment Analysis ◽

Short Term Memory ◽

Language Model ◽

Deep Understanding ◽

Language Models ◽

Catastrophic Events ◽

Social Scientists ◽

Significant Group ◽

Global Vector

Social scientists and psychologists take interest in understanding how people express emotions and sentiments when dealing with catastrophic events such as natural disasters, political unrest, and terrorism. The COVID-19 pandemic is a catastrophic event that has raised a number of psychological issues such as depression given abrupt social changes and lack of employment. Advancements of deep learning-based language models have been promising for sentiment analysis with data from social networks such as Twitter. Given the situation with COVID-19 pandemic, different countries had different peaks where rise and fall of new cases affected lock-downs which directly affected the economy and employment. During the rise of COVID-19 cases with stricter lock-downs, people have been expressing their sentiments in social media. This can provide a deep understanding of human psychology during catastrophic events. In this paper, we present a framework that employs deep learning-based language models via long short-term memory (LSTM) recurrent neural networks for sentiment analysis during the rise of novel COVID-19 cases in India. The framework features LSTM language model with a global vector embedding and state-of-art BERT language model. We review the sentiments expressed for selective months in 2020 which covers the major peak of novel cases in India. Our framework utilises multi-label sentiment classification where more than one sentiment can be expressed at once. Our results indicate that the majority of the tweets have been positive with high levels of optimism during the rise of the novel COVID-19 cases and the number of tweets significantly lowered towards the peak. We find that the optimistic, annoyed and joking tweets mostly dominate the monthly tweets with much lower portion of negative sentiments. The predictions generally indicate that although the majority have been optimistic, a significant group of population has been annoyed towards the way the pandemic was handled by the authorities.

Download Full-text

Text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning

10.31234/osf.io/293kt ◽

2021 ◽

Author(s):

Oscar Nils Erik Kjell ◽

H. Andrew Schwartz ◽

Salvatore Giorgi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rating Scale ◽

State Of The Art ◽

R Package ◽

Language Models ◽

Categorical Variables ◽

Human Language

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).

Download Full-text

Comparing gated and simple recurrent neural network architectures as models of human sentence processing

10.31234/osf.io/wec74 ◽

2018 ◽

Author(s):

Christoph Aurnhammer ◽

Stefan L. Frank

Keyword(s):

Language Processing ◽

Sentence Processing ◽

Language Model ◽

Cell Types ◽

Recurrent Network ◽

Cognitive Models ◽

Language Models ◽

Model Quality ◽

Sentence Reading ◽

Human Sentence Processing

The Simple Recurrent Network (SRN) has a long tradition in cognitive models of language processing. More recently, gated recurrent networks have been proposed that often outperform the SRN on natural language processing tasks. Here, we investigate whether two types of gated networks perform better as cognitive models of sentence reading than SRNs, beyond their advantage as language models.This will reveal whether the filtering mechanism implemented in gated networks corresponds to an aspect of human sentence processing.We train a series of language models differing only in the cell types of their recurrent layers. We then compute word surprisal values for stimuli used in self-paced reading, eye-tracking, and electroencephalography experiments, and quantify the surprisal values' fit to experimental measures that indicate human sentence reading effort.While the gated networks provide better language models, they do not outperform their SRN counterpart as cognitive models when language model quality is equal across network types. Our results suggest that the different architectures are equally valid as models of human sentence processing.

Download Full-text

Deep indexed active learning for matching heterogeneous entity representations

Proceedings of the VLDB Endowment ◽

10.14778/3485450.3485455 ◽

2021 ◽

Vol 15 (1) ◽

pp. 31-45

Author(s):

Arjit Jain ◽

Sunita Sarawagi ◽

Prithviraj Sen

Keyword(s):

Active Learning ◽

Committee Member ◽

Language Model ◽

Cartesian Product ◽

Rule Learning ◽

Search Space ◽

Training Data ◽

Language Models ◽

Passive Learning ◽

Benchmark Datasets

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.

Download Full-text