Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. With WordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments in Wikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.

Download Full-text

A Pipeline to Understand Emerging Illness Via Social Media Data Analysis: Case Study on Breast Implant Illness (Preprint)

10.2196/preprints.29768 ◽

2021 ◽

Author(s):

Vishal Dey ◽

Peter Krasniak ◽

Minh Nguyen ◽

Clara Lee ◽

Xia Ning

Keyword(s):

Mental Health ◽

Social Media ◽

Natural Language Processing ◽

Data Analysis ◽

Natural Language ◽

Language Processing ◽

Breast Implant ◽

Public Attention ◽

Social Media Data ◽

Media Data

BACKGROUND A new illness can come to public attention through social media before it is medically defined, formally documented, or systematically studied. One example is a condition known as breast implant illness (BII), which has been extensively discussed on social media, although it is vaguely defined in the medical literature. OBJECTIVE The objective of this study is to construct a data analysis pipeline to understand emerging illnesses using social media data and to apply the pipeline to understand the key attributes of BII. METHODS We constructed a pipeline of social media data analysis using natural language processing and topic modeling. Mentions related to signs, symptoms, diseases, disorders, and medical procedures were extracted from social media data using the clinical Text Analysis and Knowledge Extraction System. We mapped the mentions to standard medical concepts and then summarized these mapped concepts as topics using latent Dirichlet allocation. Finally, we applied this pipeline to understand BII from several BII-dedicated social media sites. RESULTS Our pipeline identified topics related to toxicity, cancer, and mental health issues that were highly associated with BII. Our pipeline also showed that cancers, autoimmune disorders, and mental health problems were emerging concerns associated with breast implants, based on social media discussions. Furthermore, the pipeline identified mentions such as rupture, infection, pain, and fatigue as common self-reported issues among the public, as well as concerns about toxicity from silicone implants. CONCLUSIONS Our study could inspire future studies on the suggested symptoms and factors of BII. Our study provides the first analysis and derived knowledge of BII from social media using natural language processing techniques and demonstrates the potential of using social media information to better understand similar emerging illnesses. CLINICALTRIAL

Download Full-text

Generalized approach to sentiment analysis of short text messages in natural language processing

Information and Control Systems ◽

10.31799/1684-8853-2020-1-2-14 ◽

2020 ◽

pp. 2-14

Author(s):

Evrenii Polyakov ◽

Leonid Voskov ◽

Pavel Abramov ◽

Sergey Polyakov

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Data Analysis ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Full Range ◽

Text Messages ◽

Basic Solution ◽

Short Text

Introduction: Sentiment analysis is a complex problem whose solution essentially depends on the context, field of study andamount of text data. Analysis of publications shows that the authors often do not use the full range of possible data transformationsand their combinations. Only a part of the transformations is used, limiting the ways to develop high-quality classification models.Purpose: Developing and exploring a generalized approach to building a model, which consists in sequentially passing throughthe stages of exploratory data analysis, obtaining a basic solution, vectorization, preprocessing, hyperparameter optimization, andmodeling. Results: Comparative experiments conducted using a generalized approach for classical machine learning and deeplearning algorithms in order to solve the problem of sentiment analysis of short text messages in natural language processinghave demonstrated that the classification quality grows from one stage to another. For classical algorithms, such an increasein quality was insignificant, but for deep learning, it was 8% on average at each stage. Additional studies have shown that theuse of automatic machine learning which uses classical classification algorithms is comparable in quality to manual modeldevelopment; however, it takes much longer. The use of transfer learning has a small but positive effect on the classificationquality. Practical relevance: The proposed sequential approach can significantly improve the quality of models under developmentin natural language processing problems.

Download Full-text

Wave2Vec: Vectorizing Electroencephalography Bio-Signal for Prediction of Brain Disease

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph15081750 ◽

2018 ◽

Vol 15 (8) ◽

pp. 1750 ◽

Cited By ~ 4

Author(s):

Seonho Kim ◽

Jungjoon Kim ◽

Hong-Woo Chun

Keyword(s):

Artificial Intelligence ◽

Time Series ◽

Feature Selection ◽

Deep Learning ◽

Natural Language Processing ◽

Data Analysis ◽

Natural Language ◽

Real Number ◽

Real Time ◽

Language Processing

Interest in research involving health-medical information analysis based on artificial intelligence, especially for deep learning techniques, has recently been increasing. Most of the research in this field has been focused on searching for new knowledge for predicting and diagnosing disease by revealing the relation between disease and various information features of data. These features are extracted by analyzing various clinical pathology data, such as EHR (electronic health records), and academic literature using the techniques of data analysis, natural language processing, etc. However, still needed are more research and interest in applying the latest advanced artificial intelligence-based data analysis technique to bio-signal data, which are continuous physiological records, such as EEG (electroencephalography) and ECG (electrocardiogram). Unlike the other types of data, applying deep learning to bio-signal data, which is in the form of time series of real numbers, has many issues that need to be resolved in preprocessing, learning, and analysis. Such issues include leaving feature selection, learning parts that are black boxes, difficulties in recognizing and identifying effective features, high computational complexities, etc. In this paper, to solve these issues, we provide an encoding-based Wave2vec time series classifier model, which combines signal-processing and deep learning-based natural language processing techniques. To demonstrate its advantages, we provide the results of three experiments conducted with EEG data of the University of California Irvine, which are a real-world benchmark bio-signal dataset. After converting the bio-signals (in the form of waves), which are a real number time series, into a sequence of symbols or a sequence of wavelet patterns that are converted into symbols, through encoding, the proposed model vectorizes the symbols by learning the sequence using deep learning-based natural language processing. The models of each class can be constructed through learning from the vectorized wavelet patterns and training data. The implemented models can be used for prediction and diagnosis of diseases by classifying the new data. The proposed method enhanced data readability and intuition of feature selection and learning processes by converting the time series of real number data into sequences of symbols. In addition, it facilitates intuitive and easy recognition, and identification of influential patterns. Furthermore, real-time large-capacity data analysis is facilitated, which is essential in the development of real-time analysis diagnosis systems, by drastically reducing the complexity of calculation without deterioration of analysis performance by data simplification through the encoding process.

Download Full-text

A Framework for the Generation of Class Diagram from Text Requirements using Natural language Processing

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/041012021 ◽

2021 ◽

Vol 10 (1) ◽

pp. 25-31

Keyword(s):

Natural Language Processing ◽

Software Development ◽

Natural Language ◽

Language Processing ◽

English Language ◽

Software Requirements ◽

Class Diagram ◽

Requirement Analysis ◽

Part Of Speech ◽

Software Engineers

The software development procedure begins with identifying the requirement analysis. The process levels of the requirements start from analysing the requirements to sketch the design of the program, which is very critical work for programmers and software engineers. Moreover, many errors will happen during the requirement analysis cycle transferring to other stages, which leads to the high cost of the process more than the initial specified process. The reason behind this is because of the specifications of software requirements created in the natural language. To minimize these errors, we can transfer the software requirements to the computerized form by the UML diagram. To overcome this, a device has been designed, which plans can provide semi-automatized aid for designers to provide UML class version from software program specifications using natural Language Processing techniques. The proposed technique outlines the class diagram in a well-known configuration and additionally facts out the relationship between instructions. In this research, we propose to enhance the procedure of producing the UML diagrams by utilizing the Natural Language, which will help the software development to analyze the software requirements with fewer errors and efficient way. The proposed approach will use the parser analyze and Part of Speech (POS) tagger to analyze the user requirements entered by the user in the English language. Then, extract the verbs and phrases, etc. in the user text. The obtained results showed that the proposed method got better results in comparison with other methods published in the literature. The proposed method gave a better analysis of the given requirements and better diagrams presentation, which can help the software engineers. Key words: Part of Speech,UM

Download Full-text

Advances in Computational Linguistics and Text Processing Frameworks

Advances in Computer and Electrical Engineering - Handbook of Research on Engineering Innovations and Technology Management in Organizations ◽

10.4018/978-1-7998-2772-6.ch012 ◽

2020 ◽

pp. 217-244

Author(s):

Ayush Srivastav ◽

Hera Khan ◽

Amit Kumar Mishra

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

Machine Learning in Natural Language Processing

Handbook of Research on Machine Learning Applications and Trends ◽

10.4018/978-1-60566-766-9.ch014 ◽

2010 ◽

pp. 302-324

Author(s):

Marina Sokolova ◽

Stan Szpakowicz

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Word Sense Disambiguation ◽

Machine Learning Techniques ◽

Word Sense ◽

Part Of Speech ◽

Applications Of Machine Learning

This chapter presents applications of machine learning techniques to traditional problems in natural language processing, including part-of-speech tagging, entity recognition and word-sense disambiguation. People usually solve such problems without difficulty or at least do a very good job. Linguistics may suggest labour-intensive ways of manually constructing rule-based systems. It is, however, the easy availability of large collections of texts that has made machine learning a method of choice for processing volumes of data well above the human capacity. One of the main purposes of text processing is all manner of information extraction and knowledge extraction from such large text. Machine learning methods discussed in this chapter have stimulated wide-ranging research in natural language processing and helped build applications with serious deployment potential.

Download Full-text

Sentiment Analysis on Twitter Data of World Cup Soccer Tournament Using Machine Learning

IoT ◽

10.3390/iot1020014 ◽

2020 ◽

Vol 1 (2) ◽

pp. 218-239 ◽

Cited By ~ 2

Author(s):

Ravikumar Patel ◽

Kalpdrum Passi

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Random Forest ◽

Natural Language ◽

Language Processing ◽

Machine Learning Algorithms ◽

World Cup ◽

Part Of Speech ◽

Twitter Data ◽

Processing Techniques

In the derived approach, an analysis is performed on Twitter data for World Cup soccer 2014 held in Brazil to detect the sentiment of the people throughout the world using machine learning techniques. By filtering and analyzing the data using natural language processing techniques, sentiment polarity was calculated based on the emotion words detected in the user tweets. The dataset is normalized to be used by machine learning algorithms and prepared using natural language processing techniques like word tokenization, stemming and lemmatization, part-of-speech (POS) tagger, name entity recognition (NER), and parser to extract emotions for the textual data from each tweet. This approach is implemented using Python programming language and Natural Language Toolkit (NLTK). A derived algorithm extracts emotional words using WordNet with its POS (part-of-speech) for the word in a sentence that has a meaning in the current context, and is assigned sentiment polarity using the SentiWordNet dictionary or using a lexicon-based method. The resultant polarity assigned is further analyzed using naïve Bayes, support vector machine (SVM), K-nearest neighbor (KNN), and random forest machine learning algorithms and visualized on the Weka platform. Naïve Bayes gives the best accuracy of 88.17% whereas random forest gives the best area under the receiver operating characteristics curve (AUC) of 0.97.

Download Full-text

Natural Language Processing for Automatic Identification of Major Depressive Disorders in Free-Text Electronic Health Records

Biological Psychiatry ◽

10.1016/j.biopsych.2021.02.399 ◽

2021 ◽

Vol 89 (9) ◽

pp. S155

Author(s):

Nicolas Nunez ◽

Joanna M. Biernacka ◽

Manuel Gardea-Resendez ◽

Bhavani Singh Agnikula Kshatriya ◽

Euijung Ryu ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Depressive Disorders ◽

Automatic Identification ◽

Free Text ◽

Major Depressive ◽

Health Records ◽

Major Depressive Disorders

Download Full-text

Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation

Journal of the American Medical Informatics Association ◽

10.1136/amiajnl-2012-001453 ◽

2013 ◽

Vol 20 (5) ◽

pp. 931-939 ◽

Cited By ~ 16

Author(s):

Jeffrey P Ferraro ◽

Hal Daumé ◽

Scott L DuVall ◽

Wendy W Chapman ◽

Henk Harkema ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Adaptation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Aspect Based Sentiments from Tweets using Co-Ranking Multi-Modal Natural Language Processing Methodologies

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6305.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 1061-1068

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Detection System ◽

Word Segmentation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Sentiment Detection ◽

Analysis System ◽

Speech Tagging

Now-a-days people interest to spend their time in social sites especially twitters to post lot of tweets in every day. The posted tweets are used by many users to get the knowledge about the particular applications, products and other search engine queries. With the help of the posted tweets, their emotions and sentiments are derived which are used to get opinion about particular event. Lot of traditional sentiment detection system that has been developed but they failed to analyze huge volume of tweets and online contents with temporal patterns were also difficult to analyze. To overcome the above issues, the co-ranking multi-modal natural language processing based sentiment analysis system was developed to detect the emotions from the posted tweets. Initially, tweets of different events are collected from social sites which are processed by natural language procedures such as Stemming, Lemmatization, Part-of-speech tagging, word segmentation and parsing are applied to get the words related to posted tweets for deriving the sentiments. From the extracted emotions, co-ranking process is applied to get the opinion effectively related to particular event. Then the efficiency of the system is examined using experimental results and discussions. The introduced system recognize the sentiments from tweets with 98.80% of accuracy.

Download Full-text