Arabic Poem Generation Incorporating Deep Learning and Phonetic CNNsubword Embedding Models

Author(s):  
Sameerah Talafha ◽  
Banafsheh Rekabdar

Arabic poetry generation is a very challenging task since the linguistic structure of the Arabic language is considered a severe challenge for many researchers and developers in the Natural Language Processing (NLP) field. In this paper, we propose a poetry generation model with extended phonetic and semantic embeddings (Phonetic CNNsubword embeddings). We show that Phonetic CNNsubword embeddings have an effective contribution to the overall model performance compared to FastTextsubword embeddings. Our poetry generation model consists of a two-stage approach: (1.) generating the first verse which explicitly incorporates the theme related phrase, (2.) other verses generation with the proposed Hierarchy-Attention Sequence-to-Sequence model (HAS2S), which adequately capture word, phrase, and verse information between contexts. A comprehensive human evaluation confirms that the poems generated by our model outperform the base models in criteria such as Meaning, Coherence, Fluency, and Poeticness. Extensive quantitative experiments using Bi-Lingual Evaluation Understudy (BLEU) scores also demonstrate significant improvements over strong baselines.

Pain Medicine ◽  
2020 ◽  
Vol 21 (11) ◽  
pp. 3133-3160
Author(s):  
Patrick J Tighe ◽  
Bharadwaj Sannapaneni ◽  
Roger B Fillingim ◽  
Charlie Doyle ◽  
Michael Kent ◽  
...  

Abstract Objective Recent efforts to update the definitions and taxonomic structure of concepts related to pain have revealed opportunities to better quantify topics of existing pain research subject areas. Methods Here, we apply basic natural language processing (NLP) analyses on a corpus of >200,000 abstracts published on PubMed under the medical subject heading (MeSH) of “pain” to quantify the topics, content, and themes on pain-related research dating back to the 1940s. Results The most common stemmed terms included “pain” (601,122 occurrences), “patient” (508,064 occurrences), and “studi-” (208,839 occurrences). Contrarily, terms with the highest term frequency–inverse document frequency included “tmd” (6.21), “qol” (6.01), and “endometriosis” (5.94). Using the vector-embedded model of term definitions available via the “word2vec” technique, the most similar terms to “pain” included “discomfort,” “symptom,” and “pain-related.” For the term “acute,” the most similar terms in the word2vec vector space included “nonspecific,” “vaso-occlusive,” and “subacute”; for the term “chronic,” the most similar terms included “persistent,” “longstanding,” and “long-standing.” Topic modeling via Latent Dirichlet analysis identified peak coherence (0.49) at 40 topics. Network analysis of these topic models identified three topics that were outliers from the core cluster, two of which pertained to women’s health and obstetrics and were closely connected to one another, yet considered distant from the third outlier pertaining to age. A deep learning–based gated recurrent units abstract generation model successfully synthesized several unique abstracts with varying levels of believability, with special attention and some confusion at lower temperatures to the roles of placebo in randomized controlled trials. Conclusions Quantitative NLP models of published abstracts pertaining to pain may point to trends and gaps within pain research communities.


Author(s):  
K.G.C.M Kooragama ◽  
L.R.W.D. Jayashanka ◽  
J.A. Munasinghe ◽  
K.W. Jayawardana ◽  
Muditha Tissera ◽  
...  

2021 ◽  
Author(s):  
Dilith Sasanka ◽  
H. K. N Malshani ◽  
Uchitha I. Wickramaratne ◽  
Yashmitha Kavindi ◽  
Muditha Tissera ◽  
...  

10.2196/23230 ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. e23230
Author(s):  
Pei-Fu Chen ◽  
Ssu-Ming Wang ◽  
Wei-Chih Liao ◽  
Lu-Cheng Kuo ◽  
Kuan-Chih Chen ◽  
...  

Background The International Classification of Diseases (ICD) code is widely used as the reference in medical system and billing purposes. However, classifying diseases into ICD codes still mainly relies on humans reading a large amount of written material as the basis for coding. Coding is both laborious and time-consuming. Since the conversion of ICD-9 to ICD-10, the coding task became much more complicated, and deep learning– and natural language processing–related approaches have been studied to assist disease coders. Objective This paper aims at constructing a deep learning model for ICD-10 coding, where the model is meant to automatically determine the corresponding diagnosis and procedure codes based solely on free-text medical notes to improve accuracy and reduce human effort. Methods We used diagnosis records of the National Taiwan University Hospital as resources and apply natural language processing techniques, including global vectors, word to vectors, embeddings from language models, bidirectional encoder representations from transformers, and single head attention recurrent neural network, on the deep neural network architecture to implement ICD-10 auto-coding. Besides, we introduced the attention mechanism into the classification model to extract the keywords from diagnoses and visualize the coding reference for training freshmen in ICD-10. Sixty discharge notes were randomly selected to examine the change in the F1-score and the coding time by coders before and after using our model. Results In experiments on the medical data set of National Taiwan University Hospital, our prediction results revealed F1-scores of 0.715 and 0.618 for the ICD-10 Clinical Modification code and Procedure Coding System code, respectively, with a bidirectional encoder representations from transformers embedding approach in the Gated Recurrent Unit classification model. The well-trained models were applied on the ICD-10 web service for coding and training to ICD-10 users. With this service, coders can code with the F1-score significantly increased from a median of 0.832 to 0.922 (P<.05), but not in a reduced interval. Conclusions The proposed model significantly improved the F1-score but did not decrease the time consumed in coding by disease coders.


2020 ◽  
Vol 6 ◽  
Author(s):  
David Owen ◽  
Laurence Livermore ◽  
Quentin Groom ◽  
Alex Hardisty ◽  
Thijs Leegwater ◽  
...  

We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies. Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images. Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text. Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html). We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.


Sign in / Sign up

Export Citation Format

Share Document