Selection of correction candidates for the normalization of Spanish user-generated content

AbstractWe present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).

Download Full-text

Natural language processing tools for computer assisted language learning

Linguistik Online ◽

10.13092/lo.17.790 ◽

2003 ◽

Vol 17 (5) ◽

Author(s):

Anne Vandeventer Faltin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Learning ◽

Language Processing ◽

Computer Assisted ◽

Computer Assisted Language Learning ◽

Sentence Structure ◽

Error Diagnosis ◽

Diagnosis System ◽

Spell Checker

This paper illustrates the usefulness of natural language processing (NLP) tools for computer assisted language learning (CALL) through the presentation of three NLP tools integrated within a CALL software for French. These tools are (i) a sentence structure viewer; (ii) an error diagnosis system; and (iii) a conjugation tool. The sentence structure viewer helps language learners grasp the structure of a sentence, by providing lexical and grammatical information. This information is derived from a deep syntactic analysis. Two different outputs are presented. The error diagnosis system is composed of a spell checker, a grammar checker, and a coherence checker. The spell checker makes use of alpha-codes, phonological reinterpretation, and some ad hoc rules to provide correction proposals. The grammar checker employs constraint relaxation and phonological reinterpretation as diagnosis techniques. The coherence checker compares the underlying "semantic" structures of a stored answer and of the learners' input to detect semantic discrepancies. The conjugation tool is a resource with enhanced capabilities when put on an electronic format, enabling searches from inflected and ambiguous verb forms.

Download Full-text

Text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning

10.31234/osf.io/293kt ◽

2021 ◽

Author(s):

Oscar Nils Erik Kjell ◽

H. Andrew Schwartz ◽

Salvatore Giorgi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rating Scale ◽

State Of The Art ◽

R Package ◽

Language Models ◽

Categorical Variables ◽

Human Language

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).

Download Full-text

Database Tuning using Natural Language Processing

ACM SIGMOD Record ◽

10.1145/3503780.3503788 ◽

2021 ◽

Vol 50 (3) ◽

pp. 27-28

Author(s):

Immanuel Trummer

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Training Data ◽

Language Models ◽

Learning Approaches ◽

Training Samples ◽

Starting Point ◽

Training Cost ◽

Transformer Model

Introduction. We have seen significant advances in the state of the art in natural language processing (NLP) over the past few years [20]. These advances have been driven by new neural network architectures, in particular the Transformer model [19], as well as the successful application of transfer learning approaches to NLP [13]. Typically, training for specific NLP tasks starts from large language models that have been pre-trained on generic tasks (e.g., predicting obfuscated words in text [5]) for which large amounts of training data are available. Using such models as a starting point reduces task-specific training cost as well as the number of required training samples by orders of magnitude [7]. These advances motivate new use cases for NLP methods in the context of databases.

Download Full-text

Recent Developments in Natural Language Processing

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.005 ◽

2021 ◽

Author(s):

Constantin Orasan ◽

Ruslan Mitkov

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Positive Impact ◽

User Generated Content ◽

Conversational Agents ◽

Automatic Assessment ◽

Reading Material ◽

Multimodal Information ◽

Recent Developments

Natural Language Processing (NLP) is a dynamic and rapidly developing field in which new trends, techniques, and applications are constantly emerging. This chapter focuses mainly on recent developments in NLP which could not be covered in other chapters of the Handbook. Topics such as crowdsourcing and processing of large datasets, which are no longer that recent but are widely used and not covered at length in any other chapter, are also presented. The chapter starts by describing how the availability of tools and resources has had a positive impact on the field. The proliferation of user-generated content has led to the emergence of research topics such as sarcasm and irony detection, automatic assessment of user-generated content, and stance detection. All of these topics are discussed in the chapter. The field of NLP is approaching maturity, a fact corroborated by the latest developments in the processing of texts for financial purposes and for helping users with disabilities, two topics that are also discussed here. The chapter presents examples of how researchers have successfully combined research in computer vision and natural language processing to enable the processing of multimodal information, as well as how the latest advances in deep learning have revitalized research on chatbots and conversational agents. The chapter concludes with a comprehensive list of further reading material and additional resources.

Download Full-text

Automatisierte Abbildung semantisch heterogener I4.0-Verwaltungsschalen durch Methoden des Natural Language Processing

at - Automatisierungstechnik ◽

10.1515/auto-2021-0050 ◽

2021 ◽

Vol 69 (11) ◽

pp. 940-951

Author(s):

Maximilian Both ◽

Jochen Müller ◽

Christian Diedrich

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Industrie 4.0 ◽

Language Models ◽

Wird Eine ◽

Iso Standards

Zusammenfassung Systeme im Bereich Industrie 4.0 sollen interoperabel miteinander agieren können. Damit dies automatisiert realisiert werden kann, müssen sie semantisch interoperabel sein. Hierfür fokussiert der aktuelle Industrie 4.0 Forschungsansatz einen semantisch homogenen Sprachraum. In diesem Paper wird eine Methode vorgestellt, die diesen Ansatz um heterogene Semantik erweitert. Die Abbildung unbekannter Vokabulare auf eine Zielontologie ermöglicht die Interaktionen heterogener Verwaltungsschalen. Basis der Abbildung sind Methoden aus dem Bereich Natural Language Processing. Hierzu werden auf ISO Standards vortrainierte language models und sentence embeddings kombiniert. Dies führt zu einer vielversprechenden Genauigkeit bei dem erstellten Evaluationsdatensatz, welcher unterschiedliche Semantiken für Identifikation- und Design-Teilmodelle des Projektes Pumpe 4.0 enthält.

Download Full-text

Unsupervised multi-sense language models for natural language processing tasks

Neural Networks ◽

10.1016/j.neunet.2021.05.023 ◽

2021 ◽

Author(s):

Jihyeon Roh ◽

Sungjin Park ◽

Bo-Kyeong Kim ◽

Sang-Hoon Oh ◽

Soo-Young Lee

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Models

Download Full-text

Online Assignment Plagiarism Detector

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1057 ◽

2021 ◽

pp. 528-535

Author(s):

Nikhil Paymode ◽

Rahul Yadav ◽

Sudarshan Vichare ◽

Suvarna Bhoir

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing Technique ◽

The Internet ◽

Plagiarism Detection ◽

Efficient Manner ◽

Detection Approach ◽

Different Sources ◽

Student Program

Plagiarism is a big intricacy for companies, Schools, Colleges, and those who published their document on the web. In-Schools and Colleges maximum students write their assignments and experiments by copying other documents. Using this system teachers and examiners can detect the documents and sheets either it is written by a respective student or it is copied from someone else. For checking plagiarism the system takes two or more documents as a input and after using string matching algorithms, NLP ( natural language processing) technique, as well as an NLTK toolkit (natural language toolkit), produces output. In the output, the system returns some score which is an interval of 0 to 1. Where 1 and 0 refer to exactly similar and nothing is similar (Unique) respectively. If a score between 0 to 1 then it shows only some part of the document is similar. The main objective of the system is to find the more accurate plagiarism content in the documents with similar meanings and concepts that are correctly identified in an efficient manner. It is very easy to copy the data from different sources which includes the internet, papers, books over the internet, newspapers, etc. there is a need of detecting plagiarism to increase and improve the learning of students. To solve this problem, a student program plagiarism detection approach is proposed based on Natural Language Processing.

Download Full-text

Inducing Relational Knowledge from BERT

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6242 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7456-7463 ◽

Cited By ~ 3

Author(s):

Zied Bouraoui ◽

Jose Camacho-Collados ◽

Steven Schockaert

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Word Embeddings ◽

Relational Knowledge ◽

Wide Range ◽

Fine Tune ◽

Standard Word

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.

Download Full-text

A Comprehensive Exploration of Pre-training Language Models

10.36227/techrxiv.14820348 ◽

2021 ◽

Author(s):

Tong Guo

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Contextual Information ◽

Experimental Results ◽

Language Models

Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of transformer-based models with the same amount of text and the same training steps. The experimental results shows that the most improvement upon the origin BERT is adding the RNN-layer to capture more contextual information for the transformer-encoder layers.

Download Full-text

Large Language Models for Latvian Named Entity Recognition

Frontiers in Artificial Intelligence and Applications - Human Language Technologies – The Baltic Perspective ◽

10.3233/faia200603 ◽

2020 ◽

Author(s):

Rinalds Vīksna ◽

Inguna Skadiņa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Named Entity ◽

Language Data

Transformer-based language models pre-trained on large corpora have demonstrated good results on multiple natural language processing tasks for widely used languages including named entity recognition (NER). In this paper, we investigate the role of the BERT models in the NER task for Latvian. We introduce the BERT model pre-trained on the Latvian language data. We demonstrate that the Latvian BERT model, pre-trained on large Latvian corpora, achieves better results (81.91 F1-measure on average vs 78.37 on M-BERT for a dataset with nine named entity types, and 79.72 vs 78.83 on another dataset with seven types) than multilingual BERT and outperforms previously developed Latvian NER systems.

Download Full-text