Automatically Identifying the Source Words of Lexical Blends in English

Newly coined words pose problems for natural language processing systems because they are not in a system's lexicon, and therefore no lexical information is available for such words. A common way to form new words is lexical blending, as in cosmeceutical, a blend of cosmetic and pharmaceutical. We propose a statistical model for inferring a blend's source words drawing on observed linguistic properties of blends; these properties are largely based on the recognizability of the source words in a blend. We annotate a set of 1,186 recently coined expressions which includes 515 blends, and evaluate our methods on a 324-item subset. In this first study of novel blends we achieve an accuracy of 40% on the task of inferring a blend's source words, which corresponds to a reduction in error rate of 39% over an informed baseline. We also give preliminary results showing that our features for source word identification can be used to distinguish blends from other kinds of novel words.

Download Full-text

Natural Language Inference for Arabic Using Extended Tree Edit Distance with Subtrees

Journal of Artificial Intelligence Research ◽

10.1613/jair.3892 ◽

2013 ◽

Vol 48 ◽

pp. 1-22 ◽

Cited By ~ 10

Author(s):

M. Alabbas ◽

A. Ramsay

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Edit Distance ◽

Arabic Text ◽

Tree Edit Distance ◽

Single Node ◽

Preliminary Results ◽

Standard Algorithm ◽

Standard Tree

Many natural language processing (NLP) applications require the computation of similarities between pairs of syntactic or semantic trees. Many researchers have used tree edit distance for this task, but this technique suffers from the drawback that it deals with single node operations only. We have extended the standard tree edit distance algorithm to deal with subtree transformation operations as well as single nodes. The extended algorithm with subtree operations, TED+ST, is more effective and flexible than the standard algorithm, especially for applications that pay attention to relations among nodes (e.g. in linguistic trees, deleting a modifier subtree should be cheaper than the sum of deleting its components individually). We describe the use of TED+ST for checking entailment between two Arabic text snippets. The preliminary results of using TED+ST were encouraging when compared with two string-based approaches and with the standard algorithm.

Download Full-text

A Survey on Using Gaze Behaviour for Natural Language Processing

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/683 ◽

2020 ◽

Author(s):

Sandeep Mathias ◽

Diptesh Kanojia ◽

Abhijit Mishra ◽

Pushpak Bhattacharya

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Eye Tracking ◽

Language Processing ◽

Word Identification ◽

Test Time ◽

Gaze Behaviour ◽

Cognitive Information ◽

Essay Grading ◽

Multiple Languages

Gaze behaviour has been used as a way to gather cognitive information for a number of years. In this paper, we discuss the use of gaze behaviour in solving different tasks in natural language processing (NLP) without having to record it at test time. This is because the collection of gaze behaviour is a costly task, both in terms of time and money. Hence, in this paper, we focus on research done to alleviate the need for recording gaze behaviour at run time. We also mention different eye tracking corpora in multiple languages, which are currently available and can be used in natural language processing. We conclude our paper by discussing applications in a domain - education - and how learning gaze behaviour can help in solving the tasks of complex word identification and automatic essay grading.

Download Full-text

Identification of Chinese Unknown Word Based on Finite Multi-List Method

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.474-476.460 ◽

2011 ◽

Vol 474-476 ◽

pp. 460-465

Author(s):

Bo Sun ◽

Sheng Hui Huang ◽

Xiao Hua Liu

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Identification ◽

Difficult Problem ◽

Word Segmentation ◽

Chinese Word ◽

Unknown Word ◽

List Method ◽

Low Rate

Unknown word is a kind of word that is not included in the sub_word vocabulary, but must be cut out by the word segmentation program. Peoples’ names, place names and translated names are the major unknown words.Unknown Chinese words is a difficult problem in natural language processing, and also contributed to the low rate of correct segmention. This paper introduces the finite multi-list method that using the word fragments’ capability to composite a word and the location in the word tree to process the unknown Chinese words.The experiment recall is 70.67% ,the correct rate is 43.65% .The result of the experiment shows that unknown Chinese word identification based on the finite multi-list method is feasible.

Download Full-text

Enhancing Case Capture, Quality, and Completeness of Primary Melanoma Pathology Records via Natural Language Processing

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00006 ◽

2019 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Jared C. Malke ◽

Shida Jin ◽

Samuel P. Camp ◽

Bryan Lari ◽

Trey Kell ◽

...

Keyword(s):

Prognostic Factors ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Error Rate ◽

Primary Melanoma ◽

Data Abstraction ◽

Data Set ◽

Large Patient ◽

Data Points

PURPOSE Medical records contain a wealth of useful, informative data points valuable for clinical research. Most data points are stored in semistructured or unstructured legacy documents and require manual data abstraction into a structured format to render the information more readily accessible, searchable, and generally analysis ready. The substantial labor needed for this can be cost prohibitive, particularly when dealing with large patient cohorts. METHODS To establish a high-throughput approach to data abstraction, we developed a novel framework using natural language processing (NLP) and a decision-rules algorithm to extract, transform, and load (ETL) melanoma primary pathology features from pathology reports in an institutional legacy electronic medical record system into a structured database. We compared a subset of these data with a manually curated data set comprising the same patients and developed a novel scoring system to assess confidence in records generated by the algorithm, thus obviating manual review of high-confidence records while flagging specific, low-confidence records for review. RESULTS The algorithm generated 368,624 individual melanoma data points comprising 16 primary tumor prognostic factors and metadata from 23,039 patients. From these data points, a subset of 147,872 was compared with an existing, manually abstracted data set, demonstrating an exact or synonymous match between 90.4% of all data points. Additionally, the confidence-scoring algorithm demonstrated an error rate of only 3.7%. CONCLUSION Our NLP platform can identify and abstract melanoma primary prognostic factors with accuracy comparable to that of manual abstraction (< 5% error rate), with vastly greater efficiency. Principles used in the development of this algorithm could be expanded to include other melanoma-specific data points as well as disease-agnostic fields and further enhance capture of essential elements from nonstructured data.

Download Full-text

Natural Language Processing and Its Challenges on Omotic Language Group of Ethiopia

Journal of Computer Science Research ◽

10.30564/jcsr.v3i4.3614 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Girma Yohannis Bade

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Identification ◽

Resource Limitation ◽

Identification Problem ◽

Digital Data ◽

Data Resource ◽

Language Groups ◽

Recent Developments

This article reviews Natural Language Processing (NLP) and its challenge on Omotic language groups. All technological achievements are partially fuelled by the recent developments in NLP. NPL is one of component of an artificial intelligence (AI) and offers the facility to the companies that need to analyze their reliable business data. However, there are many challenges that tackle the effectiveness of NLP applications on Omotic language groups (Ometo) of Ethiopia. These challenges are irregularity of the words, stop word identification problem, compounding and languages ‘digital data resource limitation. Thus, this study opens the room to the upcoming researchers to further investigate the NLP application on these language groups.

Download Full-text

Natural Language Processing for Biomedical Tools Discovery: A Feasibility Study and Preliminary Results

Business Information Systems - Lecture Notes in Business Information Processing ◽

10.1007/978-3-319-06695-0_12 ◽

2014 ◽

pp. 134-145 ◽

Cited By ~ 1

Author(s):

Pepi Sfakianaki ◽

Lefteris Koumakis ◽

Stelios Sfakianakis ◽

Manolis Tsiknakis

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Feasibility Study ◽

Language Processing ◽

Preliminary Results

Download Full-text

Natural Language Processing, Lexicon and Semantics

Methods of Information in Medicine ◽

10.1055/s-0038-1634581 ◽

1995 ◽

Vol 34 (01/02) ◽

pp. 68-74 ◽

Cited By ~ 1

Author(s):

E. Wehrli ◽

R. Clark

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Scientific Knowledge ◽

General Procedure ◽

Intermediate Level ◽

Syntactic Processing ◽

Application Domain ◽

Lexical Information ◽

Pragmatic Interpretation

Abstract:While the design of a fully general procedure for semantico-pragmatic interpretation of natural language texts does not seem to be feasible with current scientific knowledge and technology, the more practical micro-world based approaches lack generality and portability. A compromise between generality and practicality might lie in the use of an intermediate level of representation (“pseudo-semantics”), which can be derived from syntactic representations and lexical information by means of a general procedure. Domain-dependent rules for semantico-pragmatic interpretation can then be applied to these representations, insulating syntactic processing, from details of the application domain.

Download Full-text

Splitting-merging model of Chinese word tokenization and segmentation

Natural Language Engineering ◽

10.1017/s1351324998002058 ◽

1998 ◽

Vol 4 (4) ◽

pp. 309-324 ◽

Cited By ~ 1

Author(s):

YUAN YAO ◽

KIM TEN LUA

Keyword(s):

Natural Language Processing ◽

Mutual Information ◽

Natural Language ◽

Language Processing ◽

Word Segmentation ◽

Chinese Characters ◽

Chinese Word ◽

Blank Space ◽

New Words ◽

Segmentation Methods

Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.

Download Full-text

Is the Best Better? Bayesian Statistical Model Comparison for Natural Language Processing

10.18653/v1/2020.emnlp-main.172 ◽

2020 ◽

Author(s):

Piotr Szymański ◽

Kyle Gorman

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Statistical Model ◽

Language Processing ◽

Model Comparison

Download Full-text

A Preliminary Study for Building an Arabic Corpus of Pair Questions-texts from the Web: AQA-WebCorp

International Journal of Recent Contributions from Engineering Science & IT (iJES) ◽

10.3991/ijes.v4i2.5345 ◽

2016 ◽

Vol 4 (2) ◽

pp. 38

Author(s):

Wided Bakari ◽

Patrice Bellot ◽

Mahmoud Neji

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Data Base ◽

Machine Translation ◽

Language Processing ◽

Electronic Media ◽

Preliminary Results ◽

Preliminary Study ◽

The Web

With the development of electronic media and the heterogeneity of Arabic data on the Web, the idea of building a clean corpus for certain applications of natural language processing, including machine translation, information retrieval, question answer, become more and more pressing. In this manuscript, we seek to create and develop our own corpus of pair’s questions-texts. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given query a list of html pages. Then clean these pages to the extent of having a data base of texts and a corpus of pair’s question-texts. In addition, we give preliminary results of our proposal method. Some investigations for the construction of Arabic corpus are also presented in this document.

Download Full-text