Automatic Construction of Fine-Grained Paraphrase Corpora System Using Language Inference Model

Ying Zhou; Xiaokang Hu; Vera Chung

doi:10.3390/app12010499

Automatic Construction of Fine-Grained Paraphrase Corpora System Using Language Inference Model

Applied Sciences ◽

10.3390/app12010499 ◽

2022 ◽

Vol 12 (1) ◽

pp. 499

Author(s):

Ying Zhou ◽

Xiaokang Hu ◽

Vera Chung

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Classification Models ◽

Inference Model ◽

Inference Models ◽

Semantic Level ◽

Automatic Construction ◽

Fine Grained ◽

Sentence Level ◽

Paraphrase Generation

Paraphrase detection and generation are important natural language processing (NLP) tasks. Yet the term paraphrase is broad enough to include many fine-grained relations. This leads to different tolerance levels of semantic divergence in the positive paraphrase class among publicly available paraphrase datasets. Such variation can affect the generalisability of paraphrase classification models. It may also impact the predictability of paraphrase generation models. This paper presents a new model which can use few corpora of fine-grained paraphrase relations to construct automatically using language inference models. The fine-grained sentence level paraphrase relations are defined based on word and phrase level counterparts. We demonstrate that the fine-grained labels from our proposed system can make it possible to generate paraphrases at desirable semantic level. The new labels could also contribute to general sentence embedding techniques.

Download Full-text

Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods

Computational Linguistics ◽

10.1162/coli_a_00002 ◽

2010 ◽

Vol 36 (3) ◽

pp. 341-387 ◽

Cited By ~ 65

Author(s):

Nitin Madnani ◽

Bonnie J. Dorr

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Recent Work ◽

Language Processing ◽

Data Driven ◽

Future Trends ◽

Automatic Construction ◽

Work Done ◽

Paraphrase Generation ◽

Potential Use

The task of paraphrasing is inherently familiar to speakers of all languages. Moreover, the task of automatically generating or extracting semantic equivalences for the various units of language—words, phrases, and sentences—is an important part of natural language processing (NLP) and is being increasingly employed to improve the performance of several NLP applications. In this article, we attempt to conduct a comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research. Recent work done in manual and automatic construction of paraphrase corpora is also examined. We also discuss the strategies used for evaluating paraphrase generation techniques and briefly explore some future trends in paraphrase generation.

Download Full-text

Senti-BAS: A BERT-based model with sentiment computing for happiness research (Preprint)

10.2196/preprints.27914 ◽

2021 ◽

Author(s):

Zeyuan Zeng ◽

Yijia Zhang ◽

Liang Yang ◽

Hongfei Lin

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Language Processing ◽

High Accuracy ◽

Language Models ◽

Fine Grained ◽

Label Information ◽

Common Criterion ◽

Text Content ◽

Sentiment Computing

BACKGROUND Happiness becomes a rising topic that we all care about recently. It can be described in various forms. For the text content, it is an interesting subject that we can do research on happiness by utilizing natural language processing (NLP) methods. OBJECTIVE As an abstract and complicated emotion, there is no common criterion to measure and describe happiness. Therefore, researchers are creating different models to study and measure happiness. METHODS In this paper, we present a deep-learning based model called Senti-BAS (BERT embedded Bi-LSTM with self-Attention mechanism along with the Sentiment computing). RESULTS Given a sentence that describes how a person felt happiness recently, the model can classify the happiness scenario in the sentence with two topics: was it controlled by the author (label ‘agency’), and was it involving other people (label ‘social’). Besides language models, we employ the label information through sentiment computing based on lexicon. CONCLUSIONS The model performs with a high accuracy on both ‘agency’ and ‘social’ labels, and we also make comparisons with several popular embedding models like Elmo, GPT. Depending on our work, we can study the happiness at a more fine-grained level.

Download Full-text

Semi-Supervised Natural Language Processing Approach for Fine-Grained Classification of Medical Reports

10.1109/urtc49097.2019.9660430 ◽

2019 ◽

Author(s):

Neil Deshmukh

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fine Grained ◽

Processing Approach ◽

Medical Reports

Download Full-text

Sentence embeddings in NLI with iterative refinement encoders

Natural Language Engineering ◽

10.1017/s1351324919000202 ◽

2019 ◽

Vol 25 (4) ◽

pp. 467-482 ◽

Cited By ~ 3

Author(s):

Aarne Talman ◽

Anssi Yli-Jyrä ◽

Jörg Tiedemann

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Recurrent Neural Networks ◽

State Of The Art ◽

Iterative Refinement ◽

Learning Tasks ◽

Sentence Level ◽

Refinement Strategy

AbstractSentence-level representations are necessary for various natural language processing tasks. Recurrent neural networks have proven to be very effective in learning distributed representations and can be trained efficiently on natural language inference tasks. We build on top of one such model and propose a hierarchy of bidirectional LSTM and max pooling layers that implements an iterative refinement strategy and yields state of the art results on the SciTail dataset as well as strong results for Stanford Natural Language Inference and Multi-Genre Natural Language Inference. We can show that the sentence embeddings learned in this way can be utilized in a wide variety of transfer learning tasks, outperforming InferSent on 7 out of 10 and SkipThought on 8 out of 9 SentEval sentence embedding evaluation tasks. Furthermore, our model beats the InferSent model in 8 out of 10 recently published SentEval probing tasks designed to evaluate sentence embeddings’ ability to capture some of the important linguistic properties of sentences.

Download Full-text

Can MOOC Instructor Be Portrayed by Semantic Features? Using Discourse and Clustering Analysis to Identify Lecture-Style of Instructors in MOOCs

Frontiers in Psychology ◽

10.3389/fpsyg.2021.751492 ◽

2021 ◽

Vol 12 ◽

Author(s):

Changcheng Wu ◽

Junyi Li ◽

Ye Zhang ◽

Chunmei Lan ◽

Kaiji Zhou ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Semantic Analysis ◽

Sentence Length ◽

Semantic Features ◽

Course Quality ◽

Fine Grained ◽

Massive Open Online Course ◽

Complex Words

Nowadays, most courses in massive open online course (MOOC) platforms are xMOOCs, which are based on the traditional instruction-driven principle. Course lecture is still the key component of the course. Thus, analyzing lectures of the instructors of xMOOCs would be helpful to evaluate the course quality and provide feedback to instructors and researchers. The current study aimed to portray the lecture styles of instructors in MOOCs from the perspective of natural language processing. Specifically, 129 course transcripts were downloaded from two major MOOC platforms. Two semantic analysis tools (linguistic inquiry and word count and Coh-Metrix) were used to extract semantic features including self-reference, tone, effect, cognitive words, cohesion, complex words, and sentence length. On the basis of the comments of students, course video review, and the results of cluster analysis, we found four different lecture styles: “perfect,” “communicative,” “balanced,” and “serious.” Significant differences were found between the different lecture styles within different disciplines for notes taking, discussion posts, and overall course satisfaction. Future studies could use fine-grained log data to verify the results of our study and explore how to use the results of natural language processing to improve the lecture of instructors in both MOOCs and traditional classes.

Download Full-text

A Security-Related Reputation Scheme of Android Apps Based on NLP Analysis of Comments

10.20944/preprints202003.0332.v1 ◽

2020 ◽

Author(s):

Franklin Tchakounté ◽

Athanase Esdras Yera Pagore ◽

Marcellin Atemkeng ◽

Jean Claude Kamgang

Keyword(s):

Natural Language Processing ◽

Sentiment Analysis ◽

Language Processing ◽

Text Analysis ◽

Positive Functional ◽

Fine Grained ◽

Android Apps ◽

Depth Analysis ◽

Android Applications ◽

Google Play

Comments are exploited by product vendors to measure satisfaction of consumers. With the advent of Natural Language Processing (NLP), comments on Google Play can be processed to extract knowledge on applications such as their reputation. Proposals in that direction are either informal or interested merely on functionality. Unlike, this work aims to determine reputation of Android applications in terms of confidentiality, integrity, availability and authentication (CIAA). This work proposes a model of assessing app reputation relying on sentiment analysis and text analysis of comments. While assuming that comments are reliable, we collect Google Play applications subject to comments which include security keywords. An in-depth analysis of keywords based on Naive Bayes classification is made to provide polarity of any comment. Based on comment polarity, reputation is evaluated for the whole application. Experiments made on real applications including dozens to billions of comments, reveal that developers lack to make efforts to guarantee CIAA services. A fine-grained analysis shows that not security reputed applications can be reputed in specific CIAA services. Results also show that applications with negative security polarities display in general positive functional polarities. This result suggests that security checking should include careful comment analysis to improve security of applications.

Download Full-text

Ανάλυση συναισθήματος και γνώμης

10.12681/eadd/44752 ◽

2018 ◽

Author(s):

Αγγελική-Σπυριδούλα Βλαχοστέργιου

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Context Aware ◽

Sentence Level ◽

Document Level

Τα τελευταία χρόνια έχει παρατηρηθεί μια αύξηση του αριθμού των προσπαθειών για την αυτόματη αναγνώριση και κατηγοριοποίηση του ανθρωπίνου συναισθήματος χρησιμοποιώντας σήματα φυσιολογίας, σήματα από το πρόσωπο, τη φωνή, καθώς επίσης και προσωπικές ερμηνείες από κείμενα μεγάλων κοινωνικών δεδομένων. Αρκετοί είναι οι τομείς της έρευνας που θα μπορούσαν να επωφεληθούν από αυτά τα συστήματα: διαδραστικά συστήματα διδασκαλίας, τα οποία να επιτρέπουν στους εκπαιδευτικούς να γνωρίζουν το άγχος των φοιτητών, πρόληψη των ατυχημάτων (π.χ. εντοπισμός της κόπωσης του οδηγού), στρατιωτικά ομαδικά καθήκοντα που χαρακτηρίζονται από μεγάλης διάρκειας περιόδους άγχους και πίεσης και εφαρμογές στον τομέα της Υγείας για την έγκαιρη διάγνωση νευροεκφυλιστικών νόσων (π.χ. νόσος του Πάρκινσον), όπου η εκδήλωση των συμπτωμάτων συμβαίνει πολλά χρόνια μετά την έναρξη του νευροεκφυλισμού.Ωστόσο, παρά τις μέχρι τώρα ερευνητικές προσπάθειες, δεν έχει επιτευχθεί ο μακροπρόθεσμος στόχος της δημιουργίας ενός ισχυρού πλαισίου αναγνώρισης του εξεταζόμενου τομέα έρευνας που να βασίζεται στην ανάλυση και στην ερμηνεία του. Δεν υπάρχει καμία αμφιβολία ότι η δημιουργία του συναισθήματος (affect production) επηρεάζεται από το εκάστοτε πλαίσιο που λαμβάνει χώρα τη δεδομένη στιγμή, όπως το έργο στο οποίο υποβάλλεται ο χρήστης, τα άτομα που αλληλεπιδρούν με το χρήστη, η ταυτότητα αλλά και η εκφραστικότητά τους. Η οποιαδήποτε λοιπόν συμπληρωματική μορφή πληροφορίας πλαισίου αναφορικά με τον εξεταζόμενο τομέα έρευνας μας βοηθά ώστε να απαντήσουμε στο ερώτημα: τί είναι πιθανότερο να συμβεί, εκτρέποντας έτσι τον ταξινομητή από τις πιθανότερες/σχετικές κατηγορίες. Χωρίς το πλαίσιο, ακόμη και οι άνθρωποι μπορεί να παρερμηνεύουν τις παρατηρούμενες εκφράσεις του. Έτσι, με την αντιμετώπιση των προκλήσεων υπό το πρίσμα της αναγνώρισης του συναισθήματος υπό συγκεκριμένο πλαίσιο (context-aware affect analysis), δηλαδή με την καλύτερη μελέτη των πληροφοριών πλαισίου, με την ερμηνεία του σε συγκεκριμένους τομείς εφαρμογών, την αναπαράστασή του, τη μοντελοποίησή του, μπορούμε να προσεγγίσουμε καλύτερα την αναγνώριση του συναισθήματος σε πραγματικό χρόνο. Αντίστοιχα, στον τομέα των προσωπικών ερμηνειών από το κείμενο (Sentiment Analysis) αλλά και γενικότερα στον τομέα της Φυσικής Γλώσσας (Natural Language Processing (NLP)) η συνεισφορά του πλαισίου έγκειται στην καλύτερη αναγνώριση, ερμηνεία και επεξεργασία των απόψεων (opinions) και συναισθημάτων (sentiments) σε κείμενα, τα οποία εξετάζονται σε επίπεδο κειμένου (document-level), προτάσεων sentence-level και χαρακτηριστικών (aspect-level) αντίστοιχα. Στην περίπτωση αυτή, λαμβάνονται υπόψιν η σημασιολογία, οι γνωστικές και οι συναισθηματικές πληροφορίες των υποκειμενικών απαντήσεων των ατόμων. Ειδικότερα, στον τομέα αυτό, η συνεισφορά μας έγκειται στην εκπαίδευση ισχυρών αναπαραστάσεις χαρακτηριστικών από μη επισημειωμένα δεδομένα με τη χρήση Νευρωνικών Δικτύων και συγκεκριμένα με τη χρήση Ανταγωνιστικά Παραγωγικών Μοντέλων (GANs), η χρήση των οποίων έχει επιδείξει εντυπωσιακά αποτελέσματα στον τομέα της Όρασης Υπολογιστών. Η πρωτοτυπία της συγκεριμένης μεθόδου έγκειται στον τρόπο υλοποίησης του μοντέλου, στην επιλογή των υπερπαραμετρων, στη χρήση μη επιβλεπόμενης μάθησης και στην πειραματική επικύρωση του προτεινόμενου μοντέλου σε σώματα κειμένου που προέρχονται από διαφορετικές πηγές αναφορικά με το είδος τους και την έκτασή τους.

Download Full-text

Russian-Language Thesauri: Automatic Construction and Application for Natural Language Processing Tasks

Automatic Control and Computer Sciences ◽

10.3103/s0146411619070149 ◽

2019 ◽

Vol 53 (7) ◽

pp. 705-718

Author(s):

N. S. Lagutina ◽

K. V. Lagutina ◽

A. S. Adrianov ◽

I. V. Paramonov

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Russian Language ◽

Automatic Construction

Download Full-text

Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders

npj Schizophrenia ◽

10.1038/s41537-021-00154-3 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Sunny X. Tang ◽

Reno Kriz ◽

Sunghye Cho ◽

Suh Jung Park ◽

Jenna Harowitz ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Small Sample ◽

First Person ◽

Schizophrenia Spectrum Disorders ◽

Speech Disturbance ◽

Schizophrenia Spectrum ◽

Sentence Level ◽

Spectrum Disorders

AbstractComputerized natural language processing (NLP) allows for objective and sensitive detection of speech disturbance, a hallmark of schizophrenia spectrum disorders (SSD). We explored several methods for characterizing speech changes in SSD (n = 20) compared to healthy control (HC) participants (n = 11) and approached linguistic phenotyping on three levels: individual words, parts-of-speech (POS), and sentence-level coherence. NLP features were compared with a clinical gold standard, the Scale for the Assessment of Thought, Language and Communication (TLC). We utilized Bidirectional Encoder Representations from Transformers (BERT), a state-of-the-art embedding algorithm incorporating bidirectional context. Through the POS approach, we found that SSD used more pronouns but fewer adverbs, adjectives, and determiners (e.g., “the,” “a,”). Analysis of individual word usage was notable for more frequent use of first-person singular pronouns among individuals with SSD and first-person plural pronouns among HC. There was a striking increase in incomplete words among SSD. Sentence-level analysis using BERT reflected increased tangentiality among SSD with greater sentence embedding distances. The SSD sample had low speech disturbance on average and there was no difference in group means for TLC scores. However, NLP measures of language disturbance appear to be sensitive to these subclinical differences and showed greater ability to discriminate between HC and SSD than a model based on clinical ratings alone. These intriguing exploratory results from a small sample prompt further inquiry into NLP methods for characterizing language disturbance in SSD and suggest that NLP measures may yield clinically relevant and informative biomarkers.

Download Full-text

Identification of Adverse Drug Event–Related Japanese Articles: Natural Language Processing Analysis (Preprint)

10.2196/preprints.22661 ◽

2020 ◽

Author(s):

Shogo Ujiie ◽

Shuntaro Yada ◽

Shoko Wakamiya ◽

Eiji Aramaki

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Drug Safety ◽

Automated System ◽

Pharmaceutical Companies ◽

Manual Labor ◽

Sentence Level ◽

Medical Articles ◽

Document Level

BACKGROUND Medical articles covering adverse drug events (ADEs) are systematically reported by pharmaceutical companies for drug safety information purposes. Although policies governing reporting to regulatory bodies vary among countries and regions, all medical article reporting may be categorized as precision or recall based. Recall-based reporting, which is implemented in Japan, requires the reporting of any possible ADE. Therefore, recall-based reporting can introduce numerous false negatives or substantial amounts of noise, a problem that is difficult to address using limited manual labor. OBJECTIVE Our aim was to develop an automated system that could identify ADE-related medical articles, support recall-based reporting, and alleviate manual labor in Japanese pharmaceutical companies. METHODS Using medical articles as input, our system based on natural language processing applies document-level classification to extract articles containing ADEs (replacing manual labor in the first screening) and sentence-level classification to extract sentences within those articles that imply ADEs (thus supporting experts in the second screening). We used 509 Japanese medical articles annotated by a medical engineer to evaluate the performance of the proposed system. RESULTS Document-level classification yielded an F1 of 0.903. Sentence-level classification yielded an F1 of 0.413. These were averages of fivefold cross-validations. CONCLUSIONS A simple automated system may alleviate the manual labor involved in screening drug safety–related medical articles in pharmaceutical companies. After improving the accuracy of the sentence-level classification by considering a wider context, we intend to apply this system toward real-world postmarketing surveillance.

Download Full-text