Augment BERT with average pooling layer for Chinese summary generation

The BERT pre-trained language model has achieved good results in various subtasks of natural language processing, but its performance in generating Chinese summaries is not ideal. The most intuitive reason is that the BERT model is based on character-level composition, while the Chinese language is mostly in the form of phrases. Directly fine-tuning the BERT model cannot achieve the expected effect. This paper proposes a novel summary generation model with BERT augmented by the pooling layer. In our model, we perform an average pooling operation on token embedding to improve the model’s ability to capture phrase-level semantic information. We use LCSTS and NLPCC2017 to verify our proposed method. Experimental data shows that the average pooling model’s introduction can effectively improve the generated summary quality. Furthermore, different data needs to be set with varying pooling kernel sizes to achieve the best results through comparative analysis. In addition, our proposed method has strong generalizability. It can be applied not only to the task of generating summaries, but also to other natural language processing tasks.

Download Full-text

Multimodal Hate Speech Detection in Greek Social Media

Multimodal Technologies and Interaction ◽

10.3390/mti5070034 ◽

2021 ◽

Vol 5 (7) ◽

pp. 34

Author(s):

Konstantinos Perifanos ◽

Dionysis Goutsos

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Hate Speech ◽

Language Model ◽

Fine Tuning ◽

Accuracy Score ◽

Speech Detection ◽

Online Social Media

Hateful and abusive speech presents a major challenge for all online social media platforms. Recent advances in Natural Language Processing and Natural Language Understanding allow for more accurate detection of hate speech in textual streams. This study presents a new multimodal approach to hate speech detection by combining Computer Vision and Natural Language processing models for abusive context detection. Our study focuses on Twitter messages and, more specifically, on hateful, xenophobic, and racist speech in Greek aimed at refugees and migrants. In our approach, we combine transfer learning and fine-tuning of Bidirectional Encoder Representations from Transformers (BERT) and Residual Neural Networks (Resnet). Our contribution includes the development of a new dataset for hate speech classification, consisting of tweet IDs, along with the code to obtain their visual appearance, as they would have been rendered in a web browser. We have also released a pre-trained Language Model trained on Greek tweets, which has been used in our experiments. We report a consistently high level of accuracy (accuracy score = 0.970, f1-score = 0.947 in our best model) in racist and xenophobic speech detection.

Download Full-text

A WORD-BASED CHINESE LANGUAGE UNDERSTANDING SYSTEM

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001488000042 ◽

1988 ◽

Vol 02 (01) ◽

pp. 25-35

Author(s):

TIAN-SHUN YAO

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chinese Language ◽

Computer Programs ◽

World Knowledge ◽

Knowledge Source ◽

Language Understanding ◽

Language Analysis ◽

The World

With the word-based theory of natural language processing, a word-based Chinese language understanding system has been developed. In the light of psychological language analysis and the features of the Chinese language, this theory of natural language processing is presented with the description of the computer programs based on it. The heart of the system is to define a Total Information Dictionary and the World Knowledge Source used in the system. The purpose of this research is to develop a system which can understand not only Chinese sentences but also the whole text.

Download Full-text

EventEpi–A Natural Language Processing Framework for Event-Based Surveillance

10.1101/19006395 ◽

2019 ◽

Author(s):

Auss Abbood ◽

Alexander Ullrich ◽

Rüdiger Busche ◽

Stéphane Ghozzi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Application ◽

Fine Tuning ◽

Entity Recognition ◽

World Health ◽

Support Vector ◽

Event Based ◽

Processing Framework

AbstractAccording to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of epidemiologists sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural-language-processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We trained a naive Bayes classifier to find the single most likely one using RKI’s EBS database as labels. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using document and word embeddings. Two of the tested algorithms stood out: The multilayer perceptron performed best overall, with a precision of 0.19, recall of 0.50, specificity of 0.89, F1 of 0.28, and the highest tested index balanced accuracy of 0.46. The support-vector machine, on the other hand, had the highest recall (0.88) which can be of higher interest for epidemiologists. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code is publicly available at https://github.com/aauss/EventEpi.

Download Full-text

EMOSIS Sentiment Analysis on Tweets with Emotion and Intensity Level Recognition Considering Ending Punctuation Marks

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d4518.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 10289-10293

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Emotion Recognition ◽

Sentiment Analysis ◽

Language Processing ◽

Significant Role ◽

Language Model ◽

Intensity Level ◽

Processing Stage ◽

Overall Performance

Sentiment Analysis is a tool used for determining the Polarity or Emotion of a Sentence. It is a field of Natural Language Processing which focuses on the study of opinions. In this study, the researchers solved one key challenge in Sentiment Analysis, which is to consider the Ending Punctuation Marks present in a sentence. Ending punctuation marks plays a significant role in Emotion Recognition and Intensity Level Recognition. The research made used of tweets expressing opinions about Philippine President Rodrigo Duterte. These downloaded tweets served as the inputs. It was initially subjected to pre-processing stage to be able to prepare the sentences for processing. A Language Model was created to serve as the classifier for determining the scores of the tweets. The scores give the polarity of the sentence. Accuracy is very important in sentiment analysis. To increase the chance of correctly identifying the polarity of the tweets, the input undergone Intensity Level Recognition which determines the intensifiers and negations within the sentences. The system was evaluated with overall performance of 80.27%.

Download Full-text

Syntactic and semantic information extraction from NPP procedures utilizing natural language processing integrated with rules

Nuclear Engineering and Technology ◽

10.1016/j.net.2020.08.010 ◽

2020 ◽

Author(s):

Yongsun Choi ◽

Minh Duc Nguyen ◽

Thomas N. Kerr

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Semantic Information

Download Full-text

Forty-two Million Ways to Describe Pain: Topic Modeling of 200,000 PubMed Pain-Related Abstracts Using Natural Language Processing and Deep Learning–Based Text Generation

Pain Medicine ◽

10.1093/pm/pnaa061 ◽

2020 ◽

Vol 21 (11) ◽

pp. 3133-3160

Author(s):

Patrick J Tighe ◽

Bharadwaj Sannapaneni ◽

Roger B Fillingim ◽

Charlie Doyle ◽

Michael Kent ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Taxonomic Structure ◽

Generation Model ◽

Pain Research ◽

Embedded Model ◽

Gated Recurrent Units

Abstract Objective Recent efforts to update the definitions and taxonomic structure of concepts related to pain have revealed opportunities to better quantify topics of existing pain research subject areas. Methods Here, we apply basic natural language processing (NLP) analyses on a corpus of >200,000 abstracts published on PubMed under the medical subject heading (MeSH) of “pain” to quantify the topics, content, and themes on pain-related research dating back to the 1940s. Results The most common stemmed terms included “pain” (601,122 occurrences), “patient” (508,064 occurrences), and “studi-” (208,839 occurrences). Contrarily, terms with the highest term frequency–inverse document frequency included “tmd” (6.21), “qol” (6.01), and “endometriosis” (5.94). Using the vector-embedded model of term definitions available via the “word2vec” technique, the most similar terms to “pain” included “discomfort,” “symptom,” and “pain-related.” For the term “acute,” the most similar terms in the word2vec vector space included “nonspecific,” “vaso-occlusive,” and “subacute”; for the term “chronic,” the most similar terms included “persistent,” “longstanding,” and “long-standing.” Topic modeling via Latent Dirichlet analysis identified peak coherence (0.49) at 40 topics. Network analysis of these topic models identified three topics that were outliers from the core cluster, two of which pertained to women’s health and obstetrics and were closely connected to one another, yet considered distant from the third outlier pertaining to age. A deep learning–based gated recurrent units abstract generation model successfully synthesized several unique abstracts with varying levels of believability, with special attention and some confusion at lower temperatures to the roles of placebo in randomized controlled trials. Conclusions Quantitative NLP models of published abstracts pertaining to pain may point to trends and gaps within pain research communities.

Download Full-text

Natural Language Processing Techniques for the Extraction of Semantic Information in Web Services

2008 Seventh Mexican International Conference on Artificial Intelligence ◽

10.1109/micai.2008.50 ◽

2008 ◽

Cited By ~ 2

Author(s):

Maricela Bravo ◽

Azucena Montes ◽

Alejandro Reyes

Keyword(s):

Natural Language Processing ◽

Web Services ◽

Natural Language ◽

Language Processing ◽

Semantic Information ◽

Processing Techniques

Download Full-text

Logical Intelligent Detection Algorithm of Chinese Language Articles Based on Text Mining

Mobile Information Systems ◽

10.1155/2021/8115551 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Zihui Zheng

Keyword(s):

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Chinese Language ◽

Detection Algorithm ◽

Processing Technology ◽

Experimental Comparison ◽

Intelligent Detection ◽

Web Text Mining

With the advent of the big data era and the rapid development of the Internet industry, the information processing technology of text mining has become an indispensable role in natural language processing. In our daily life, many things cannot be separated from natural language processing technology, such as machine translation, intelligent response, and semantic search. At the same time, with the development of artificial intelligence, text mining technology has gradually developed into a research hotspot. There are many ways to realize text mining. This paper mainly describes the realization of web text mining and the realization of text structure algorithm based on HTML through a variety of methods to compare the specific clustering time of web text mining. Through this comparison, we can also get which web mining is the most efficient. The use of WebKB datasets for many times in experimental comparison also reflects that Web text mining for the Chinese language logic intelligent detection algorithm provides a basis.

Download Full-text

DeNERT-KG: Named Entity and Relation Extraction Model Using DQN, Knowledge Graph, and BERT

Applied Sciences ◽

10.3390/app10186429 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6429

Author(s):

SungMin Yang ◽

SoYeop Yoo ◽

OkRan Jeong

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Named Entity Recognition ◽

Relation Extraction ◽

Entity Recognition ◽

Knowledge Graph ◽

Named Entity ◽

Artificial Intelligence Technology

Along with studies on artificial intelligence technology, research is also being carried out actively in the field of natural language processing to understand and process people’s language, in other words, natural language. For computers to learn on their own, the skill of understanding natural language is very important. There are a wide variety of tasks involved in the field of natural language processing, but we would like to focus on the named entity registration and relation extraction task, which is considered to be the most important in understanding sentences. We propose DeNERT-KG, a model that can extract subject, object, and relationships, to grasp the meaning inherent in a sentence. Based on the BERT language model and Deep Q-Network, the named entity recognition (NER) model for extracting subject and object is established, and a knowledge graph is applied for relation extraction. Using the DeNERT-KG model, it is possible to extract the subject, type of subject, object, type of object, and relationship from a sentence, and verify this model through experiments.

Download Full-text

Arabic Poem Generation Incorporating Deep Learning and Phonetic CNNsubword Embedding Models

International Journal of Robotic Computing ◽

10.35708/tai1868-126246 ◽

2019 ◽

pp. 64-91

Author(s):

Sameerah Talafha ◽

Banafsheh Rekabdar

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Model Performance ◽

Arabic Language ◽

Generation Model ◽

Two Stage ◽

Human Evaluation ◽

Effective Contribution

Arabic poetry generation is a very challenging task since the linguistic structure of the Arabic language is considered a severe challenge for many researchers and developers in the Natural Language Processing (NLP) field. In this paper, we propose a poetry generation model with extended phonetic and semantic embeddings (Phonetic CNNsubword embeddings). We show that Phonetic CNNsubword embeddings have an effective contribution to the overall model performance compared to FastTextsubword embeddings. Our poetry generation model consists of a two-stage approach: (1.) generating the first verse which explicitly incorporates the theme related phrase, (2.) other verses generation with the proposed Hierarchy-Attention Sequence-to-Sequence model (HAS2S), which adequately capture word, phrase, and verse information between contexts. A comprehensive human evaluation confirms that the poems generated by our model outperform the base models in criteria such as Meaning, Coherence, Fluency, and Poeticness. Extensive quantitative experiments using Bi-Lingual Evaluation Understudy (BLEU) scores also demonstrate significant improvements over strong baselines.

Download Full-text