scholarly journals Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

2019 ◽  
Vol 9 (18) ◽  
pp. 3648
Author(s):  
Casper S. Shikali ◽  
Zhou Sijie ◽  
Liu Qihe ◽  
Refuoe Mokhosi

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Author(s):  
Casper Shikali Shivachi ◽  
Refuoe Mokhosi ◽  
Zhou Shijie ◽  
Liu Qihe

The need to capture intra-word information in natural language processing (NLP) tasks has inspired research in learning various word representations at word, character, or morpheme levels, but little attention has been given to syllables from a syllabic alphabet. Motivated by the success of compositional models in morphological languages, we present a Convolutional-long short term memory (Conv-LSTM) model for constructing Swahili word representation vectors from syllables. The unified architecture addresses the word agglutination and polysemous nature of Swahili by extracting high-level syllable features using a convolutional neural network (CNN) and then composes quality word embeddings with a long short term memory (LSTM). The word embeddings are then validated using a syllable-aware language model ( 31.267 ) and a part-of-speech (POS) tagging task ( 98.78 ), both yielding very competitive results to the state-of-art models in their respective domains. We further validate the language model using Xhosa and Shona, which are syllabic-based languages. The novelty of the study is in its capability to construct quality word embeddings from syllables using a hybrid model that does not use max-over-pool common in CNN and then the exploitation of these embeddings in POS tagging. Therefore, the study plays a crucial role in the processing of agglutinative and syllabic-based languages by contributing quality word embeddings from syllable embeddings, a robust Conv–LSTM model that learns syllables for not only language modeling and POS tagging, but also for other downstream NLP tasks.


2020 ◽  
Vol 34 (05) ◽  
pp. 7456-7463 ◽  
Author(s):  
Zied Bouraoui ◽  
Jose Camacho-Collados ◽  
Steven Schockaert

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.


2021 ◽  
Vol 11 (22) ◽  
pp. 11018
Author(s):  
Xianwen Liao ◽  
Yongzhong Huang ◽  
Changfu Wei ◽  
Chenhao Zhang ◽  
Yongqing Deng ◽  
...  

Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.


AI ◽  
2021 ◽  
Vol 2 (1) ◽  
pp. 1-16
Author(s):  
Juan Cruz-Benito ◽  
Sanjay Vishwakarma ◽  
Francisco Martin-Fernandez ◽  
Ismael Faro

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.


2020 ◽  
Vol 14 (4) ◽  
pp. 471-484
Author(s):  
Suraj Shetiya ◽  
Saravanan Thirumuruganathan ◽  
Nick Koudas ◽  
Gautam Das

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.


2019 ◽  
Author(s):  
José Padarian ◽  
Ignacio Fuentes

Abstract. A large amount of descriptive information is available in most disciplines of geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings lay on a multi-dimensional space where angles and distances have a linguistic interpretation. We used 280 764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations, namely: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. Since this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9 %. The resulting embedding and test suite will be made available for other researchers to use an expand.


Author(s):  
Bali Ranaivo-Malancon

Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian as many other languages are very similar, and therefore it is a real difficulty to search, retrieve, classify, and above all translate texts written in one of the two languages. We have built a language identifier todetermine whether the text is written in Malay or Indonesian which could be used in any similar situation. It uses the frequency and rank of trigrams of characters, the lists of exclusive words, and the format of numbers. The trigrams are derived from the most frequent words in each language. The current program contains as language models: Malay/Indonesian (661 trigrams), Dutch (826 trigrams), English (652 trigrams), French (579 trigrams), and German (482 trigrams). The trigrams of an unknown text are searched in each language model. The language of the input text is the language having the highest ratio in “number of shared trigrams / total number of trigrams” and “number of winner trigrams / number of shared trigrams”. If the language found at trigram search level is ’Malay or Indonesian’, the text is then scanned by searching the format of numbers and ofsome exclusive words.


2018 ◽  
Author(s):  
Christoph Aurnhammer ◽  
Stefan L. Frank

The Simple Recurrent Network (SRN) has a long tradition in cognitive models of language processing. More recently, gated recurrent networks have been proposed that often outperform the SRN on natural language processing tasks. Here, we investigate whether two types of gated networks perform better as cognitive models of sentence reading than SRNs, beyond their advantage as language models.This will reveal whether the filtering mechanism implemented in gated networks corresponds to an aspect of human sentence processing.We train a series of language models differing only in the cell types of their recurrent layers. We then compute word surprisal values for stimuli used in self-paced reading, eye-tracking, and electroencephalography experiments, and quantify the surprisal values' fit to experimental measures that indicate human sentence reading effort.While the gated networks provide better language models, they do not outperform their SRN counterpart as cognitive models when language model quality is equal across network types. Our results suggest that the different architectures are equally valid as models of human sentence processing.


2021 ◽  
Vol 24 (67) ◽  
pp. 1-17
Author(s):  
Flávio Arthur O. Santos ◽  
Thiago Dias Bispo ◽  
Hendrik Teixeira Macedo ◽  
Cleber Zanchettin

Natural language processing systems have attracted much interest of the industry. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and answer, and others. Word embeddings (i.e., continuous word representations) are an essential module for those applications generally used as word representation to machine learning models. Some popular methods to train word embeddings are GloVe and Word2Vec. They achieve good word representations, despite limitations: both ignore morphological information of the words and consider only one representation vector for each word. This approach implies the word embeddings does not consider different word contexts properly and are unaware of its inner structure. To mitigate this problem, the other word embeddings method FastText represents each word as a bag of characters n-grams. Hence, a continuous vector describes each n-gram, and the final word representation is the sum of its characters n-grams vectors. Nevertheless, the use of all n-grams character of a word is a poor approach since some n-grams have no semantic relation with their words and increase the amount of potentially useless information. This approach also increase the training phase time. In this work, we propose a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word. Thus, words with similar context and morphemes are represented by vectors close to each other. To evaluate our new approach, we performed intrinsic evaluations considering 15 different tasks, and the results show a competitive performance compared to FastText. Moreover, the proposed model is $40\%$ faster than FastText in the training phase. We also outperform the baseline approaches in extrinsic evaluations through Hate speech detection and NER tasks using different scenarios.


Author(s):  
Monica Bobrowski ◽  
Sabrina Soler

Data plays a critical role in organizations up to the point of being considered a competitive advantage. However, the quality of the organizations’ data is often inadequate, affecting strategic and tactical decision making, and even weakening the organization’s image. Nevertheless it is still challenging to encourage management to invest in data quality improvement projects. Performing a traditional feasibility analysis based on Return on Investment, Net Present Value, etc., may not capture the advantages of data quality projects: their benefits are often difficult to quantify and uncertain; also, they are mostly valuable because of the new opportunities they bring about. Dealing with this problem through a real options approach, in order to model its intrinsic uncertainty, seems to be an interesting starting point. This paper presents a methodological framework to assess the benefits of a Data Quality project using a real options approach. Its adequacy is validated with a case study.


Sign in / Sign up

Export Citation Format

Share Document