scholarly journals Automatically Identifying the Source Words of Lexical Blends in English

2010 ◽  
Vol 36 (1) ◽  
pp. 129-149 ◽  
Author(s):  
Paul Cook ◽  
Suzanne Stevenson

Newly coined words pose problems for natural language processing systems because they are not in a system's lexicon, and therefore no lexical information is available for such words. A common way to form new words is lexical blending, as in cosmeceutical, a blend of cosmetic and pharmaceutical. We propose a statistical model for inferring a blend's source words drawing on observed linguistic properties of blends; these properties are largely based on the recognizability of the source words in a blend. We annotate a set of 1,186 recently coined expressions which includes 515 blends, and evaluate our methods on a 324-item subset. In this first study of novel blends we achieve an accuracy of 40% on the task of inferring a blend's source words, which corresponds to a reduction in error rate of 39% over an informed baseline. We also give preliminary results showing that our features for source word identification can be used to distinguish blends from other kinds of novel words.

2013 ◽  
Vol 48 ◽  
pp. 1-22 ◽  
Author(s):  
M. Alabbas ◽  
A. Ramsay

Many natural language processing (NLP) applications require the computation of similarities between pairs of syntactic or semantic trees. Many researchers have used tree edit distance for this task, but this technique suffers from the drawback that it deals with single node operations only. We have extended the standard tree edit distance algorithm to deal with subtree transformation operations as well as single nodes. The extended algorithm with subtree operations, TED+ST, is more effective and flexible than the standard algorithm, especially for applications that pay attention to relations among nodes (e.g. in linguistic trees, deleting a modifier subtree should be cheaper than the sum of deleting its components individually). We describe the use of TED+ST for checking entailment between two Arabic text snippets. The preliminary results of using TED+ST were encouraging when compared with two string-based approaches and with the standard algorithm.


Author(s):  
Sandeep Mathias ◽  
Diptesh Kanojia ◽  
Abhijit Mishra ◽  
Pushpak Bhattacharya

Gaze behaviour has been used as a way to gather cognitive information for a number of years. In this paper, we discuss the use of gaze behaviour in solving different tasks in natural language processing (NLP) without having to record it at test time. This is because the collection of gaze behaviour is a costly task, both in terms of time and money. Hence, in this paper, we focus on research done to alleviate the need for recording gaze behaviour at run time. We also mention different eye tracking corpora in multiple languages, which are currently available and can be used in natural language processing. We conclude our paper by discussing applications in a domain - education - and how learning gaze behaviour can help in solving the tasks of complex word identification and automatic essay grading.


2011 ◽  
Vol 474-476 ◽  
pp. 460-465
Author(s):  
Bo Sun ◽  
Sheng Hui Huang ◽  
Xiao Hua Liu

Unknown word is a kind of word that is not included in the sub_word vocabulary, but must be cut out by the word segmentation program. Peoples’ names, place names and translated names are the major unknown words.Unknown Chinese words is a difficult problem in natural language processing, and also contributed to the low rate of correct segmention. This paper introduces the finite multi-list method that using the word fragments’ capability to composite a word and the location in the word tree to process the unknown Chinese words.The experiment recall is 70.67% ,the correct rate is 43.65% .The result of the experiment shows that unknown Chinese word identification based on the finite multi-list method is feasible.


2019 ◽  
pp. 1-11 ◽  
Author(s):  
Jared C. Malke ◽  
Shida Jin ◽  
Samuel P. Camp ◽  
Bryan Lari ◽  
Trey Kell ◽  
...  

PURPOSE Medical records contain a wealth of useful, informative data points valuable for clinical research. Most data points are stored in semistructured or unstructured legacy documents and require manual data abstraction into a structured format to render the information more readily accessible, searchable, and generally analysis ready. The substantial labor needed for this can be cost prohibitive, particularly when dealing with large patient cohorts. METHODS To establish a high-throughput approach to data abstraction, we developed a novel framework using natural language processing (NLP) and a decision-rules algorithm to extract, transform, and load (ETL) melanoma primary pathology features from pathology reports in an institutional legacy electronic medical record system into a structured database. We compared a subset of these data with a manually curated data set comprising the same patients and developed a novel scoring system to assess confidence in records generated by the algorithm, thus obviating manual review of high-confidence records while flagging specific, low-confidence records for review. RESULTS The algorithm generated 368,624 individual melanoma data points comprising 16 primary tumor prognostic factors and metadata from 23,039 patients. From these data points, a subset of 147,872 was compared with an existing, manually abstracted data set, demonstrating an exact or synonymous match between 90.4% of all data points. Additionally, the confidence-scoring algorithm demonstrated an error rate of only 3.7%. CONCLUSION Our NLP platform can identify and abstract melanoma primary prognostic factors with accuracy comparable to that of manual abstraction (< 5% error rate), with vastly greater efficiency. Principles used in the development of this algorithm could be expanded to include other melanoma-specific data points as well as disease-agnostic fields and further enhance capture of essential elements from nonstructured data.


2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Girma Yohannis Bade

This article reviews Natural Language Processing (NLP) and its challenge on Omotic language groups. All technological achievements are partially fuelled by the recent developments in NLP. NPL is one of component of an artificial intelligence (AI) and offers the facility to the companies that need to analyze their reliable business data. However, there are many challenges that tackle the effectiveness of NLP applications on Omotic language groups (Ometo) of Ethiopia. These challenges are irregularity of the words, stop word identification problem, compounding and languages ‘digital data resource limitation. Thus, this study opens the room to the upcoming researchers to further investigate the NLP application on these language groups.


1995 ◽  
Vol 34 (01/02) ◽  
pp. 68-74 ◽  
Author(s):  
E. Wehrli ◽  
R. Clark

Abstract:While the design of a fully general procedure for semantico-pragmatic interpretation of natural language texts does not seem to be feasible with current scientific knowledge and technology, the more practical micro-world based approaches lack generality and portability. A compromise between generality and practicality might lie in the use of an intermediate level of representation (“pseudo-semantics”), which can be derived from syntactic representations and lexical information by means of a general procedure. Domain-dependent rules for semantico-pragmatic interpretation can then be applied to these representations, insulating syntactic processing, from details of the application domain.


1998 ◽  
Vol 4 (4) ◽  
pp. 309-324 ◽  
Author(s):  
YUAN YAO ◽  
KIM TEN LUA

Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.


Author(s):  
Wided Bakari ◽  
Patrice Bellot ◽  
Mahmoud Neji

With the development of electronic media and the heterogeneity of Arabic data on the Web, the idea of building a clean corpus for certain applications of natural language processing, including machine translation, information retrieval, question answer, become more and more pressing. In this manuscript, we seek to create and develop our own corpus of pair’s questions-texts. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given query a list of html pages. Then clean these pages to the extent of having a data base of texts and a corpus of pair’s question-texts. In addition, we give preliminary results of our proposal method. Some investigations for the construction of Arabic corpus are also presented in this document.


Sign in / Sign up

Export Citation Format

Share Document