Feature-based natural language processing for GSL synthesis

The work reported in this study is based on research that has been carried out while developing a sign synthesis system for Greek Sign Language (GSL): theoretical linguistic analysis as well as lexicon and grammar resources derived from this analysis. We focus on the organisation of linguistic knowledge that initiates the multi-functional processing required to achieve sign generation performed by a virtual signer. In this context, structure rules and lexical coding support sign synthesis of GSL utterances, by exploitation of avatar technologies for the representation of the linguistic message. Sign generation involves two subsystems: a Greek-to-GSL conversion subsystem and a sign performance subsystem. The conversion subsystem matches input strings of written Greek-to-GSL structure patterns, exploiting Natural Language Processing (NLP) mechanisms. The sign performance subsystem uses parsed output of GSL structure patterns, enriched with sign-specific information, to activate a virtual signer for the performance of properly coded linguistic messages. Both the conversion and the synthesis procedure are based on adequately constructed electronic linguistic resources. Applicability of sign synthesis is demonstrated with the example of a Web-based prototype environment for GSL grammar teaching.

Download Full-text

Building natural language processing tools for Runyakitara

Applied Linguistics Review ◽

10.1515/applirev-2020-2004 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Fridah Katushemererwe ◽

Andrew Caines ◽

Paula Buttery

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Learning ◽

Language Processing ◽

Primary Data ◽

Computer Assisted ◽

Endangered Languages ◽

Test Case ◽

Short Supply ◽

Linguistic Resources

AbstractThis paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.

Download Full-text

Extension of Semantic Based Urdu Linguistic Resources Using Natural Language Processing

2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech) ◽

10.1109/dasc-picom-datacom-cyberscitec.2017.214 ◽

2017 ◽

Cited By ~ 4

Author(s):

Komal Khalid ◽

Hammad Afzal ◽

Faiza Moqaddas ◽

Naima Iltaf ◽

Ahmed Muqeem Sheri ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Linguistic Resources

Download Full-text

Exploiting linguistic knowledge for statistical natural language processing

10.5353/th_b4650629 ◽

2011 ◽

Author(s):

Lidan Zhang

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Linguistic Knowledge ◽

Statistical Natural Language Processing

Download Full-text

Alternative Ways for Knowledge Collection, Indexing and Robust Language Retrieval

Methods of Information in Medicine ◽

10.1055/s-0038-1634563 ◽

1998 ◽

Vol 37 (04/05) ◽

pp. 315-326 ◽

Cited By ~ 3

Author(s):

C. Lovis ◽

A.-M. Rassinoux ◽

J.-R. Scherrer ◽

R. H. Baud

Keyword(s):

Natural Language Processing ◽

Knowledge Representation ◽

Natural Language ◽

Language Processing ◽

Building Blocks ◽

Linguistic Knowledge ◽

Recent Experience ◽

Medical Texts ◽

Natural Components

AbstractDefinitions are provided of the key entities in knowledge representation for Natural Language Processing (NLP). Starting from the words, which are the natural components of any sentence, both the role of expressions and the decomposition of words into their parts are emphasized. This leads to the notion of concepts, which are either primitive or composite depending on the model where they are created. The problem of finding the most adequate degree of granularity for a concept is studied. From this reflection on basic Natural Language Processing components, four categories of linguistic knowledge are recognized, that are considered to be the building blocks of a Medical Linguistic Knowledge Base (MLKB). Following on the tracks of a recent experience in building a natural language-based patient encoding browser, a robust method for conceptual indexing and query of medical texts is presented with particular attention to the scheme of knowledge representation.

Download Full-text

FindThatQuote: A Question-Answering Web-based System to Locate Quotes using Deep Learning and Natural-Language Processing

10.5121/csit.2021.110909 ◽

2021 ◽

Author(s):

Nathan Ji ◽

Yu Sun

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Success Rate ◽

Language Processing ◽

Question Answering ◽

Specific Information ◽

Web Based ◽

User Friendly ◽

Web Based System

The digital age gives us access to a multitude of both information and mediums in which we can interpret information. A majority of the time, many people find interpreting such information difficult as the medium may not be as user friendly as possible. This project has examined the inquiry of how one can identify specific information in a given text based on a question. This inquiry is intended to streamline one's ability to determine the relevance of a given text relative to his objective. The project has an overall 80% success rate given 10 articles with three questions asked per article. This success rate indicates that this project is likely applicable to those who are asking for content level questions within an article.

Download Full-text

RussianLanguage Thesauri: Automated Construction and Application For Natural Language Processing Tasks

Modeling and Analysis of Information Systems ◽

10.18255/1818-1015-2018-4-435-458 ◽

2018 ◽

Vol 25 (4) ◽

pp. 435-458

Author(s):

Nadezhda S. Lagutina ◽

Ksenia V. Lagutina ◽

Aleksey S. Adrianov ◽

Ilya V. Paramonov

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Russian Language ◽

Labor Costs ◽

Linguistic Resources ◽

Advantages And Disadvantages ◽

Text Corpora ◽

Linguistic Methods

The paper reviews the existing Russian-language thesauri in digital form and methods of their automatic construction and application. The authors analyzed the main characteristics of open access thesauri for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. The statistical and linguistic methods of thesaurus construction that allow to automate the development and reduce labor costs of expert linguists were studied. In particular, the authors considered algorithms for extracting keywords and semantic thesaurus relationships of all types, as well as the quality of thesauri generated with the use of these tools. To illustrate features of various methods for constructing thesaurus relationships, the authors developed a combined method that generates a specialized thesaurus fully automatically taking into account a text corpus in a particular domain and several existing linguistic resources. With the proposed method, experiments were conducted with two Russian-language text corpora from two subject areas: articles about migrants and tweets. The resulting thesauri were assessed by using an integrated assessment developed in the previous authors’ study that allows to analyze various aspects of the thesaurus and the quality of the generation methods. The analysis revealed the main advantages and disadvantages of various approaches to the construction of thesauri and the extraction of semantic relationships of different types, as well as made it possible to determine directions for future study.

Download Full-text

Automatic Multilingual Stopwords Identification from Very Small Corpora

Electronics ◽

10.3390/electronics10172169 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2169

Author(s):

Stefano Ferilli

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

State Of The Art ◽

Linguistic Knowledge ◽

Linguistic Resources ◽

Novel Approach ◽

Document Frequency ◽

Critical Problems ◽

Fully Automatic ◽

Local Languages

Tools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on stopwords, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.

Download Full-text

Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system

BMJ Open ◽

10.1136/bmjopen-2018-023232 ◽

2019 ◽

Vol 9 (4) ◽

pp. e023232 ◽

Cited By ~ 7

Author(s):

Beata Fonferko-Shadrach ◽

Arron S Lacey ◽

Angus Roberts ◽

Ashley Akbari ◽

Simon Thompson ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Seizure Frequency ◽

Extraction System ◽

Health Board ◽

Free Text ◽

Specific Information ◽

Clinical Text ◽

Routinely Collected Data

ObjectiveRoutinely collected healthcare data are a powerful research resource but often lack detailed disease-specific information that is collected in clinical free text, for example, clinic letters. We aim to use natural language processing techniques to extract detailed clinical information from epilepsy clinic letters to enrich routinely collected data.DesignWe used the general architecture for text engineering (GATE) framework to build an information extraction system, ExECT (extraction of epilepsy clinical text), combining rule-based and statistical techniques. We extracted nine categories of epilepsy information in addition to clinic date and date of birth across 200 clinic letters. We compared the results of our algorithm with a manual review of the letters by an epilepsy clinician.SettingDe-identified and pseudonymised epilepsy clinic letters from a Health Board serving half a million residents in Wales, UK.ResultsWe identified 1925 items of information with overall precision, recall and F1 score of 91.4%, 81.4% and 86.1%, respectively. Precision and recall for epilepsy-specific categories were: epilepsy diagnosis (88.1%, 89.0%), epilepsy type (89.8%, 79.8%), focal seizures (96.2%, 69.7%), generalised seizures (88.8%, 52.3%), seizure frequency (86.3%–53.6%), medication (96.1%, 94.0%), CT (55.6%, 58.8%), MRI (82.4%, 68.8%) and electroencephalogram (81.5%, 75.3%).ConclusionsWe have built an automated clinical text extraction system that can accurately extract epilepsy information from free text in clinic letters. This can enhance routinely collected data for research in the UK. The information extracted with ExECT such as epilepsy type, seizure frequency and neurological investigations are often missing from routinely collected data. We propose that our algorithm can bridge this data gap enabling further epilepsy research opportunities. While many of the rules in our pipeline were tailored to extract epilepsy specific information, our methods can be applied to other diseases and also can be used in clinical practice to record patient information in a structured manner.

Download Full-text

Mapping the plague through natural language processing

10.1101/2021.04.27.21256212 ◽

2021 ◽

Author(s):

Fabienne Krauer ◽

Boris V. Schmid

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Machine Learning Algorithms ◽

Specific Information ◽

Location Data ◽

Quantitative Analyses ◽

Overall Performance ◽

Spatio Temporal ◽

Temporal Extent

AbstractPlague has caused three major pandemics with millions of casualties in the past centuries. There is a substantial amount of historical and modern primary and secondary literature about the spatial and temporal extent of epidemics, circumstances of transmission or symptoms and treatments. Many quantitative analyses rely on structured data, but the extraction of specific information such as the time and place of outbreaks is a tedious process. Machine learning algorithms for natural language processing (NLP) can potentially facilitate the establishment of datasets, but their use in plague research has not been explored much yet. We investigated the performance of five pre-trained NLP libraries (Google NLP, Stanford CoreNLP, spaCy, germaNER and Geoparser.io) for the extraction of location data from a German plague treatise published in 1908 compared to the gold standard of manual annotation. Of all tested algorithms, we found that Stanford CoreNLP had the best overall performance but spaCy showed the highest sensitivity. Moreover, we demonstrate how word associations can be extracted and displayed with simple text mining techniques in order to gain a quick insight into salient topics. Finally, we compared our newly digitised plague dataset to a re-digitised version of the famous Biraben plague list and update the spatio-temporal extent of the second pandemic plague mentions. We conclude that all NLP tools have their limitations, but they are potentially useful to accelerate the collection of data and the generation of a global plague outbreak database.

Download Full-text

Deep-Learning-Based Natural Language Processing of Serial Free-Text Radiological Reports for Predicting Rectal Cancer Patient Survival

Frontiers in Oncology ◽

10.3389/fonc.2021.747250 ◽

2021 ◽

Vol 11 ◽

Author(s):

Sunkyu Kim ◽

Choong-kun Lee ◽

Yonghwa Choi ◽

Eun Sil Baek ◽

Jeong Eun Choi ◽

...

Keyword(s):

Rectal Cancer ◽

Overall Survival ◽

Natural Language Processing ◽

Natural Language ◽

Cancer Patients ◽

Transfer Learning ◽

Language Processing ◽

Rectal Cancer Patient ◽

Linguistic Knowledge ◽

Free Text

Most electronic medical records, such as free-text radiological reports, are unstructured; however, the methodological approaches to analyzing these accumulating unstructured records are limited. This article proposes a deep-transfer-learning-based natural language processing model that analyzes serial magnetic resonance imaging reports of rectal cancer patients and predicts their overall survival. To evaluate the model, a retrospective cohort study of 4,338 rectal cancer patients was conducted. The experimental results revealed that the proposed model utilizing pre-trained clinical linguistic knowledge could predict the overall survival of patients without any structured information and was superior to the carcinoembryonic antigen in predicting survival. The deep-transfer-learning model using free-text radiological reports can predict the survival of patients with rectal cancer, thereby increasing the utility of unstructured medical big data.

Download Full-text