Learning Representations for Weakly Supervised Natural Language Processing Tasks

2014 ◽  
Vol 40 (1) ◽  
pp. 85-120 ◽  
Author(s):  
Fei Huang ◽  
Arun Ahuja ◽  
Doug Downey ◽  
Yi Yang ◽  
Yuhong Guo ◽  
...  

Finding the right representations for words is critical for building accurate NLP systems when domain-specific labeled data for the task is scarce. This article investigates novel techniques for extracting features from n-gram models, Hidden Markov Models, and other statistical language models, including a novel Partial Lattice Markov Random Field model. Experiments on part-of-speech tagging and information extraction, among other tasks, indicate that features taken from statistical language models, in combination with more traditional features, outperform traditional representations alone, and that graphical model representations outperform n-gram models, especially on sparse and polysemous words.

Author(s):  
M. D. Riazur Rahman ◽  
M. D. Tarek Habib ◽  
M. D. Sadekur Rahman ◽  
Gazi Zahirul Islam ◽  
M. D. Abbas Ali Khan

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.


2016 ◽  
Vol 105 (1) ◽  
pp. 63-76
Author(s):  
Theresa Guinard

Abstract Morphological analysis (finding the component morphemes of a word and tagging morphemes with part-of-speech information) is a useful preprocessing step in many natural language processing applications, especially for synthetic languages. Compound words from the constructed language Esperanto are formed by straightforward agglutination, but for many words, there is more than one possible sequence of component morphemes. However, one segmentation is usually more semantically probable than the others. This paper presents a modified n-gram Markov model that finds the most probable segmentation of any Esperanto word, where the model’s states represent morpheme part-of-speech and semantic classes. The overall segmentation accuracy was over 98% for a set of presegmented dictionary words.


2008 ◽  
Vol 04 (01) ◽  
pp. 87-106
Author(s):  
ALKET MEMUSHAJ ◽  
TAREK M. SOBH

Probabilistic language models have gained popularity in Natural Language Processing due to their ability to successfully capture language structures and constraints with computational efficiency. Probabilistic language models are flexible and easily adapted to language changes over time as well as to some new languages. Probabilistic language models can be trained and their accuracy strongly related to the availability of large text corpora. In this paper, we investigate the usability of grapheme probabilistic models, specifically grapheme n-grams models in spellchecking as well as augmentative typing systems. Grapheme n-gram models require substantially smaller training corpora and that is one of the main drivers for this thesis in which we build grapheme n-gram language models for the Albanian language. There are presently no available Albanian language corpora to be used for probabilistic language modeling. Our technique attempts to augment spellchecking and typing systems by utilizing grapheme n-gram language models in improving suggestion accuracy in spellchecking and augmentative typing systems. Our technique can be implemented in a standalone tool or incorporated in another tool to offer additional selection/scoring criteria.


2018 ◽  
Vol 7 (3.27) ◽  
pp. 125
Author(s):  
Ahmed H. Aliwy ◽  
Duaa A. Al_Raza

Part Of Speech (POS) tagging of Arabic words is a difficult and non-travail task it was studied in details for the last twenty years and its performance affects many applications and tasks in area of natural language processing (NLP). The sentence in Arabic language is very long compared with English sentence. This affect tagging process for any approach deals with complete sentence at once as in Hidden Markov Model HMM tagger. In this paper, new approach is suggested for using HMM and n-grams taggers for tagging Arabic words in a long sentence. The suggested approach is very simple and easy to implement. It is implemented on data set of 1000 documents of 526321 tokens annotated manually (containing punctuations). The results shows that the suggested approach has higher accuracy than HMM and n-gram taggers. The F-measures were 0.888, 0.925 and 0.957 for n-grams, HMM and the suggested approach respectively.


2020 ◽  
Author(s):  
Jose Juan Almagro Armenteros ◽  
Alexander Rosenberg Johansen ◽  
Ole Winther ◽  
Henrik Nielsen

AbstractMotivationLanguage modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed towards analyzing the properties of the datasets used to train language models. Additionally, only the performance of cherry-picked downstream tasks are used to assess the capacity of LMs.ResultsWe analyze the entire UniProt database and investigate the different properties that can bias or hinder the performance of LMs such as homology, domain of origin, quality of the data, and completeness of the sequence. We evaluate n-gram and Recurrent Neural Network (RNN) LMs to assess the impact of these properties on performance. To our knowledge, this is the first protein dataset with an emphasis on language modelling. Our inclusion of properties specific to proteins gives a detailed analysis of how well natural language processing methods work on biological sequences. We find that organism domain and quality of data have an impact on the performance, while the completeness of the proteins has little influence. The RNN based LM can learn to model Bacteria, Eukarya, and Archaea; but struggles with Viruses. By using the LM we can also generate novel proteins that are shown to be similar to real proteins.Availability and implementationhttps://github.com/alrojo/UniLanguage


Author(s):  
Dim Lam Cing ◽  
Khin Mar Soe

In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.


Author(s):  
Atro Voutilainen

This article outlines the recently used methods for designing part-of-speech taggers; computer programs for assigning contextually appropriate grammatical descriptors to words in texts. It begins with the description of general architecture and task setting. It gives an overview of the history of tagging and describes the central approaches to tagging. These approaches are: taggers based on handwritten local rules, taggers based on n-grams automatically derived from text corpora, taggers based on hidden Markov models, taggers using automatically generated symbolic language models derived using methods from machine tagging, taggers based on handwritten global rules, and hybrid taggers, which combine the advantages of handwritten and automatically generated taggers. This article focuses on handwritten tagging rules. Well-tagged training corpora are a valuable resource for testing and improving language model. The text corpus reminds the grammarian about any oversight while designing a rule.


Author(s):  
Szymon Roziewski ◽  
Marek Kozłowski

AbstractThe exponential growth of the internet community has resulted in the production of a vast amount of unstructured data, including web pages, blogs and social media. Such a volume consisting of hundreds of billions of words is unlikely to be analyzed by humans. In this work we introduce the tool LanguageCrawl, which allows Natural Language Processing (NLP) researchers to easily build web-scale corpora using the Common Crawl Archive—an open repository of web crawl information, which contains petabytes of data. We present three use cases in the course of this work: filtering of Polish websites, the construction of n-gram corpora and the training of a continuous skipgram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and n-gram ranks. This paper focuses particularly on high computing efficiency by applying highly concurrent multitasking. Our tool utilizes effective libraries and design. LanguageCrawl has been made publicly available to enrich the current set of NLP resources. We strongly believe that our work will facilitate further NLP research, especially in under-resourced languages, in which the lack of appropriately-sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.


Sign in / Sign up

Export Citation Format

Share Document