Measurement of Text Similarity: A Survey

Text similarity measurement is the basis of natural language processing tasks, which play an important role in information retrieval, automatic question answering, machine translation, dialogue systems, and document matching. This paper systematically combs the research status of similarity measurement, analyzes the advantages and disadvantages of current methods, develops a more comprehensive classification description system of text similarity measurement algorithms, and summarizes the future development direction. With the aim of providing reference for related research and application, the text similarity measurement method is described by two aspects: text distance and text representation. The text distance can be divided into length distance, distribution distance, and semantic distance; text representation is divided into string-based, corpus-based, single-semantic text, multi-semantic text, and graph-structure-based representation. Finally, the development of text similarity is also summarized in the discussion section.

Download Full-text

ALBERT-QM: An ALBERT Based Method for Chinese Health Related Question Matching (Preprint)

10.2196/preprints.17700 ◽

2020 ◽

Author(s):

Feihong Yang ◽

Jiao Li

Keyword(s):

Natural Language ◽

Language Processing ◽

Question Answering ◽

Binary Classification ◽

Medical Knowledge ◽

Reports of the Workshops Held at the 2019 AAAI Conference on Artificial Intelligence

AI Magazine ◽

10.1609/aimag.v40i3.4981 ◽

2019 ◽

Vol 40 (3) ◽

pp. 67-78

Author(s):

Guy Barash ◽

Mauricio Castillo-Effen ◽

Niyati Chhaya ◽

Peter Clark ◽

Huáscar Espinoza ◽

...

Keyword(s):

Artificial Intelligence ◽

Language Processing ◽

Cyber Security ◽

Question Answering ◽

Intent Recognition ◽

Affective Content ◽

Learning Plan ◽

Dialog System ◽

Affective Content Analysis ◽

Games And Simulations

The workshop program of the Association for the Advancement of Artificial Intelligence’s 33rd Conference on Artificial Intelligence (AAAI-19) was held in Honolulu, Hawaii, on Sunday and Monday, January 27–28, 2019. There were fifteen workshops in the program: Affective Content Analysis: Modeling Affect-in-Action, Agile Robotics for Industrial Automation Competition, Artificial Intelligence for Cyber Security, Artificial Intelligence Safety, Dialog System Technology Challenge, Engineering Dependable and Secure Machine Learning Systems, Games and Simulations for Artificial Intelligence, Health Intelligence, Knowledge Extraction from Games, Network Interpretability for Deep Learning, Plan, Activity, and Intent Recognition, Reasoning and Learning for Human-Machine Dialogues, Reasoning for Complex Question Answering, Recommender Systems Meet Natural Language Processing, Reinforcement Learning in Games, and Reproducible AI. This report contains brief summaries of the all the workshops that were held.

Download Full-text

Comparative Question Answering System based on Natural Language Processing and Machine Learning

2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS) ◽

10.1109/icais50930.2021.9396015 ◽

2021 ◽

Author(s):

Rohit Arora ◽

Parth Singh ◽

Hemlata Goyal ◽

Sunita Singhal ◽

Smita Vijayvargiya

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Question Answering ◽

Question Answering System

Download Full-text

Simulation of bright–dark soliton solutions of the Lonngren wave equation arising the model of transmission lines

Modern Physics Letters B ◽

10.1142/s0217984921504844 ◽

2021 ◽

pp. 2150484

Author(s):

Asif Yokuş

Keyword(s):

Wave Equation ◽

Soliton Solutions ◽

Stationary Wave ◽

Dark Soliton ◽

Traveling Wave Solution ◽

Bright Soliton ◽

Auxiliary Equation ◽

Discussion Section ◽

Auxiliary Equation Method ◽

Advantages And Disadvantages

In this study, the auxiliary equation method is applied successfully to the Lonngren wave equation. Bright soliton, bright–dark soliton solutions are produced, which play an important role in the distribution and distribution of electric charge. In the conclusion and discussion section, the effect of nonlinearity term on wave behavior in bright soliton traveling wave solution is examined. The advantages and disadvantages of the method are discussed. While graphs representing the stationary wave are obtained, special values are given to the constants in the solutions. These graphs are presented as 3D, 2D and contour.

Download Full-text

A content spectral-based text representation

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219248 ◽

2021 ◽

pp. 1-12

Author(s):

Melesio Crespo-Sanchez ◽

Ivan Lopez-Arevalo ◽

Edwin Aldana-Bobadilla ◽

Alejandro Molina-Villegas

Keyword(s):

Machine Learning ◽

Text Analysis ◽

Question Answering ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Text Representation ◽

Feature Vectors ◽

Learning Tasks ◽

Semantic Component ◽

Vector Representations

In the last few years, text analysis has grown as a keystone in several domains for solving many real-world problems, such as machine translation, spam detection, and question answering, to mention a few. Many of these tasks can be approached by means of machine learning algorithms. Most of these algorithms take as input a transformation of the text in the form of feature vectors containing an abstraction of the content. Most of recent vector representations focus on the semantic component of text, however, we consider that also taking into account the lexical and syntactic components the abstraction of content could be beneficial for learning tasks. In this work, we propose a content spectral-based text representation applicable to machine learning algorithms for text analysis. This representation integrates the spectra from the lexical, syntactic, and semantic components of text producing an abstract image, which can also be treated by both, text and image learning algorithms. These components came from feature vectors of text. For demonstrating the goodness of our proposal, this was tested on text classification and complexity reading score prediction tasks obtaining promising results.

Download Full-text

A survey of question answering in natural language processing

Poetics ◽

10.1016/0304-422x(90)90032-z ◽

1990 ◽

Vol 19 (1-2) ◽

pp. 99-120

Author(s):

Stefan Wermter ◽

Wendy G. Lehnert

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Question Answering

Download Full-text

Enhancing Natural Language Inference Using New and Expanded Training Data Sets and New Learning Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6371 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8504-8511

Author(s):

Arindam Mitra ◽

Ishan Shrivastava ◽

Chitta Baral

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Question Answering ◽

Training Data ◽

Data Sets ◽

Learning Models ◽

New Learning ◽

Word Attention ◽

Attention Function

Natural Language Inference (NLI) plays an important role in many natural language processing tasks such as question answering. However, existing NLI modules that are trained on existing NLI datasets have several drawbacks. For example, they do not capture the notion of entity and role well and often end up making mistakes such as “Peter signed a deal” can be inferred from “John signed a deal”. As part of this work, we have developed two datasets that help mitigate such issues and make the systems better at understanding the notion of “entities” and “roles”. After training the existing models on the new dataset we observe that the existing models do not perform well on one of the new benchmark. We then propose a modification to the “word-to-word” attention function which has been uniformly reused across several popular NLI architectures. The resulting models perform as well as their unmodified counterparts on the existing benchmarks and perform significantly well on the new benchmarks that emphasize “roles” and “entities”.

Download Full-text

Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

Wireless Communications and Mobile Computing ◽

10.1155/2021/5375334 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Changchang Zeng ◽

Shaobo Li

Keyword(s):

Reading Comprehension ◽

Language Processing ◽

Question Answering ◽

Multiple Choice ◽

Length Distribution ◽

Research Field ◽

Evaluation Framework ◽

Language Models ◽

Training Objective ◽

Machine Reading

Machine reading comprehension (MRC) is a challenging natural language processing (NLP) task. It has a wide application potential in the fields of question answering robots, human-computer interactions in mobile virtual reality systems, etc. Recently, the emergence of pretrained models (PTMs) has brought this research field into a new era, in which the training objective plays a key role. The masked language model (MLM) is a self-supervised training objective widely used in various PTMs. With the development of training objectives, many variants of MLM have been proposed, such as whole word masking, entity masking, phrase masking, and span masking. In different MLMs, the length of the masked tokens is different. Similarly, in different machine reading comprehension tasks, the length of the answer is also different, and the answer is often a word, phrase, or sentence. Thus, in MRC tasks with different answer lengths, whether the length of MLM is related to performance is a question worth studying. If this hypothesis is true, it can guide us on how to pretrain the MLM with a relatively suitable mask length distribution for MRC tasks. In this paper, we try to uncover how much of MLM’s success in the machine reading comprehension tasks comes from the correlation between masking length distribution and answer length in the MRC dataset. In order to address this issue, herein, (1) we propose four MRC tasks with different answer length distributions, namely, the short span extraction task, long span extraction task, short multiple-choice cloze task, and long multiple-choice cloze task; (2) four Chinese MRC datasets are created for these tasks; (3) we also have pretrained four masked language models according to the answer length distributions of these datasets; and (4) ablation experiments are conducted on the datasets to verify our hypothesis. The experimental results demonstrate that our hypothesis is true. On four different machine reading comprehension datasets, the performance of the model with correlation length distribution surpasses the model without correlation.

Download Full-text

ENTITY EXTRACTION FROM UNSTRUCTURED MEDICAL TEXT

International Journal of Engineering Applied Sciences and Technology ◽

10.33564/ijeast.2020.v05i08.041 ◽

2020 ◽

Vol 5 (8) ◽

Author(s):

G Deepank ◽

R Tharun Raj ◽

Aditya Verma

Keyword(s):

Language Processing ◽

Medical Records ◽

Question Answering ◽

Entity Extraction ◽

Data Repositories ◽

Efficient Treatment ◽

Learning Techniques ◽

Treatment Plans ◽

Data Elements ◽

Rich Data

Electronic medical records represent rich data repositories loaded with valuable patient information. As artificial intelligence and machine learning in the field of medicine is becoming more popular by the day, ways to integrate it are always changing. One such way is processing the clinical notes and records, which are maintained by doctors and other medical professionals. Natural language processing can record this data and read more deeply into it than any human. Deep learning techniques such as entity extraction which involves identifying and returning of key data elements from an electronic medical record, and other techniques involving models such as BERT for question answering, when applied to all these medical records can create bespoke and efficient treatment plans for the patients, which can help in a swift and carefree recovery.

Download Full-text

Emerging trends: A gentle introduction to fine-tuning

Natural Language Engineering ◽

10.1017/s1351324921000322 ◽

2021 ◽

Vol 27 (6) ◽

pp. 763-778

Author(s):

Kenneth Ward Church ◽

Zeyu Chen ◽

Yanjun Ma

Keyword(s):

Natural Language ◽

Language Processing ◽

Question Answering ◽

General Purpose ◽

Fine Tuning ◽

Language Engineering ◽

Training Models ◽

Emerging Trends ◽

Foundation Model ◽

Programming Skills

AbstractThe previous Emerging Trends article (Church et al., 2021. Natural Language Engineering27(5), 631–645.) introduced deep nets to poets. Poets is an imperfect metaphor, intended as a gesture toward inclusion. The future for deep nets will benefit by reaching out to a broad audience of potential users, including people with little or no programming skills, and little interest in training models. That paper focused on inference, the use of pre-trained models, as is, without fine-tuning. The goal of this paper is to make fine-tuning more accessible to a broader audience. Since fine-tuning is more challenging than inference, the examples in this paper will require modest programming skills, as well as access to a GPU. Fine-tuning starts with a general purpose base (foundation) model and uses a small training set of labeled data to produce a model for a specific downstream application. There are many examples of fine-tuning in natural language processing (question answering (SQuAD) and GLUE benchmark), as well as vision and speech.

Download Full-text