Data Collection for Natural Language Processing Systems

Natural language processing for similar languages, varieties, and dialects: A survey

Natural Language Engineering ◽

10.1017/s1351324920000492 ◽

2020 ◽

Vol 26 (6) ◽

pp. 595-612

Author(s):

Marcos Zampieri ◽

Preslav Nakov ◽

Yves Scherrer

Keyword(s):

Natural Language Processing ◽

Data Collection ◽

Natural Language ◽

Machine Translation ◽

Computational Methods ◽

Language Processing ◽

Language Varieties ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

AbstractThere has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.

Download Full-text

Jumpstarting the Justice Disciplines: A Computational-Qualitative Approach to Collecting and Analyzing Text and Image Data in Criminology and Criminal Justice Studies

10.31235/osf.io/4nhd6 ◽

2021 ◽

Author(s):

Alex Luscombe ◽

Jamie Duncan ◽

Kevin Walby

Keyword(s):

Qualitative Research ◽

Big Data ◽

Natural Language Processing ◽

Criminal Justice ◽

Data Collection ◽

Natural Language ◽

Computational Methods ◽

Language Processing ◽

Data Set ◽

Web Scraping

Computational methods are increasingly popular in criminal justice research. As more criminal justice data becomes available in big data and other digital formats, new means of embracing the computational turn are needed. In this article, we propose a framework for data collection and case sampling using computational methods, allowing researchers to conduct thick qualitative research – analyses concerned with the particularities of a social context or phenomenon – starting from big data, which is typically associated with thinner quantitative methods and the pursuit of generalizable findings. The approach begins by using open-source web scraping algorithms to collect content from a target website, online database, or comparable online source. Next, researchers use computational techniques from the field of natural language processing to explore themes and patterns in the larger data set. Based on these initial explorations, researchers algorithmically generate a subset of data for in-depth qualitative analysis. In this computationally driven process of data collection and case sampling, the larger corpus and subset are never entirely divorced, a feature we argue has implications for traditional qualitative research techniques and tenets. To illustrate this approach, we collect, subset, and analyze three years of news releases from the Royal Canadian Mounted Police website (N = 13,637) using a mix of web scraping, natural language processing, and visual discourse analysis. To enhance the pedagogical value of our intervention and facilitate replication and secondary analysis, we make all data and code available online in the form of a detailed, step-by-step tutorial.

Download Full-text

06.213: Attacks with Knives and Sharp Instruments: Quantitative Coding and the Witness to Atrocity

Leonardo ◽

10.1162/leon_a_00345 ◽

2012 ◽

Vol 45 (1) ◽

pp. 86-87

Author(s):

Ben Miller

Keyword(s):

Natural Language Processing ◽

Data Collection ◽

Natural Language ◽

Language Processing ◽

Traumatic Events ◽

Text Analytics ◽

Controlled Vocabularies ◽

New Methods ◽

Data Schema ◽

Text Corpora

Text corpora of testimony to survival and other traumatic events have expanded because of more efficacious and available data collection tools. New methods mobilizing controlled vocabularies, relational data schema, and natural language processing both enable these fragmentary collections of witnessing and offer ways to make them readable. “06.213” describes the background for these methods as relates to testimonial corpora and the framework of a new text analytics project focused on organizing the unstructured fragments of a collection around reader-specified conceptual foci.

Download Full-text

Outcome Measure Harmonization and Data Infrastructure for Patient-Centered Outcomes Research in Depression: Report on Registry Configuration

10.23970/ahrqepcregistryoutcome ◽

2020 ◽

Author(s):

Michelle B. Leavy ◽

Danielle Cooke ◽

Sarah Hajjar ◽

Erik Bikelman ◽

Bailey Egan ◽

...

Keyword(s):

Natural Language Processing ◽

Data Collection ◽

Adverse Events ◽

Natural Language ◽

Outcome Measures ◽

Language Processing ◽

Outcomes Research ◽

Mortality Data ◽

Clinical Workflow ◽

Patient Centered

Background: Major depressive disorder is a common mental disorder. Many pressing questions regarding depression treatment and outcomes exist, and new, efficient research approaches are necessary to address them. The primary objective of this project is to demonstrate the feasibility and value of capturing the harmonized depression outcome measures in the clinical workflow and submitting these data to different registries. Secondary objectives include demonstrating the feasibility of using these data for patient-centered outcomes research and developing a toolkit to support registries interested in sharing data with external researchers. Methods: The harmonized outcome measures for depression were developed through a multi-stakeholder, consensus-based process supported by AHRQ. For this implementation effort, the PRIME Registry, sponsored by the American Board of Family Medicine, and PsychPRO, sponsored by the American Psychiatric Association, each recruited 10 pilot sites from existing registry sites, added the harmonized measures to the registry platform, and submitted the project for institutional review board review Results: The process of preparing each registry to calculate the harmonized measures produced three major findings. First, some clarifications were necessary to make the harmonized definitions operational. Second, some data necessary for the measures are not routinely captured in structured form (e.g., PHQ-9 item 9, adverse events, suicide ideation and behavior, and mortality data). Finally, capture of the PHQ-9 requires operational and technical modifications. The next phase of this project will focus collection of the baseline and follow-up PHQ-9s, as well as other supporting clinical documentation. In parallel to the data collection process, the project team will examine the feasibility of using natural language processing to extract information on PHQ-9 scores, adverse events, and suicidal behaviors from unstructured data. Conclusion: This pilot project represents the first practical implementation of the harmonized outcome measures for depression. Initial results indicate that it is feasible to calculate the measures within the two patient registries, although some challenges were encountered related to the harmonized definition specifications, the availability of the necessary data, and the clinical workflow for collecting the PHQ-9. The ongoing data collection period, combined with an evaluation of the utility of natural language processing for these measures, will produce more information about the practical challenges, value, and burden of using the harmonized measures in the primary care and mental health setting. These findings will be useful to inform future implementations of the harmonized depression outcome measures.

Download Full-text

Natural Language Processing and Enhanced Clinical Decision Making Radiology and VINCI

PsycEXTRA Dataset ◽

10.1037/e615572012-015 ◽

2012 ◽

Author(s):

Eliot Siegel

Keyword(s):

Decision Making ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Decision Making ◽

Clinical Decision

Download Full-text

Natural Language Processing in the Clinical Setting

PsycEXTRA Dataset ◽

10.1037/e615572012-013 ◽

2012 ◽

Author(s):

Thomas H. Payne

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Setting

Download Full-text

A Review and evaluation of Machine Translation methods for Lumasaaba

Journal of Digital Science ◽

10.33847/2686-8296.2.1_1 ◽

2020 ◽

pp. 3-17

Author(s):

Peter Nabende

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

Research Area ◽

Data Driven ◽

East African ◽

Data Set ◽

African Languages ◽

Translation Methods

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.

Download Full-text

An AdaBoost Using a Weak-Learner Generating Several Weak-Hypotheses for Large Training Data of Natural Language Processing

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.130.83 ◽

2010 ◽

Vol 130 (1) ◽

pp. 83-91 ◽

Cited By ~ 1

Author(s):

Tomoya Iwakura ◽

Seishi Okamoto ◽

Kazuo Asakawa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Training Data ◽

Weak Learner

Download Full-text

1243-P: Novel Use of Natural Language Processing to Identify Reasons for Insulin Discontinuation in Patients with T2DM: A Real-World Evidence Study

Diabetes ◽

10.2337/db19-1243-p ◽

2019 ◽

Vol 68 (Supplement 1) ◽

pp. 1243-P

Author(s):

JIANMIN WU ◽

FRITHA J. MORRISON ◽

ZHENXIANG ZHAO ◽

XUANYAO HE ◽

MARIA SHUBINA ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Real World ◽

Real World Evidence

Download Full-text

WHO DO WE THINK WE ARE? COMPARING INTERSECTIONAL IDENTITY TRENDS IN ASEE AND CEEA-ACEG USING NATURAL LANGUAGE PROCESSING AND REVIEW OF PROCEEDINGS

Proceedings of the Canadian Engineering Education Association (CEEA) ◽

10.24908/pceea.vi0.13830 ◽

2019 ◽

Author(s):

Pamela Rogalski ◽

Eric Mikulin ◽

Deborah Tihanyi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Activity Theory ◽

Language Processing ◽

Division Of Labour ◽

Cultural Historical Activity Theory ◽

Original Question ◽

Micro Level ◽

Historical Activity ◽

Cultural Historical Activity

In 2018, we overheard many CEEA-AGEC members stating that they have "found their people"; this led us to wonder what makes this evolving community unique. Using cultural historical activity theory to view the proceedings of CEEA-ACEG 2004-2018 in comparison with the geographically and intellectually adjacent ASEE, we used both machine-driven (Natural Language Processing, NLP) and human-driven (literature review of the proceedings) methods. Here, we hoped to build on surveys—most recently by Nelson and Brennan (2018)—to understand, beyond what members say about themselves, what makes the CEEA-AGEC community distinct, where it has come from, and where it is going. Engaging in the two methods of data collection quickly diverted our focus from an analysis of the data themselves to the characteristics of the data in terms of cultural historical activity theory. Our preliminary findings point to some unique characteristics of machine- and human-driven results, with the former, as might be expected, focusing on the micro-level (words and language patterns) and the latter on the macro-level (ideas and concepts). NLP generated data within the realms of "community" and "division of labour" while the review of proceedings centred on "subject" and "object"; both found "instruments," although NLP with greater granularity. With this new understanding of the relative strengths of each method, we have a revised framework for addressing our original question.

Download Full-text