Text Classification by using Natural Language Processing

In this paper we introduce domain detection as a new natural language processing task. We argue that the ability to detect textual segments that are domain-heavy (i.e., sentences or phrases that are representative of and provide evidence for a given domain) could enhance the robustness and portability of various text classification applications. We propose an encoder-detector framework for domain detection and bootstrap classifiers with multiple instance learning. The model is hierarchically organized and suited to multilabel classification. We demonstrate that despite learning with minimal supervision, our model can be applied to text spans of different granularities, languages, and genres. We also showcase the potential of domain detection for text summarization.

Download Full-text

Log mining and knowledge-based models in data storage systems diagnostics

E3S Web of Conferences ◽

10.1051/e3sconf/201914003006 ◽

2019 ◽

Vol 140 ◽

pp. 03006

Author(s):

Mikhail B. Uspenskij

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Data Storage ◽

Language Processing ◽

Text Classification ◽

Storage Systems ◽

Diagnostic Model ◽

Accuracy Score ◽

Message Structure ◽

Log Mining

Modern data storage systems have a sophisticated hardware and software architecture, including multiple storage processors, storage fabrics, network equipment and storage media and contain information, which can be damaged or lost because of hardware or software fault. Approach to storage software diagnostics, presented in current paper, combines a log mining algorithms for fault detection based on natural language processing text classification methods, and usage of the diagnostic model for a task of fault source detection. Currently existing approaches to computational systems diagnostics are either ignoring system or event log data, using only numeric monitoring parameters, or target only certain log types or use logs to create chains of the structured events. The main advantage of using natural language processing method for log text classification is that no information of log message structure or log message source, or log purpose is required if there is enough data for classificator model training. Developed diagnostic procedure has accuracy score comparable with existing methods and can target all presented in training set faults without prior log structure research.

Download Full-text

Natural Language Processing Algorithms for Normalizing Expressions of Synonymous Symptoms in Traditional Chinese Medicine

Evidence-based Complementary and Alternative Medicine ◽

10.1155/2021/6676607 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Lu Zhou ◽

Shuangqiao Liu ◽

Caiyan Li ◽

Yuemeng Sun ◽

Yizhuo Zhang ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Classification ◽

Classification Model ◽

Generation Model ◽

Sigmoid Function ◽

Sequence Generation ◽

Model Based ◽

Processing Algorithms

Background. The modernization of traditional Chinese medicine (TCM) demands systematic data mining using medical records. However, this process is hindered by the fact that many TCM symptoms have the same meaning but different literal expressions (i.e., TCM synonymous symptoms). This problem can be solved by using natural language processing algorithms to construct a high-quality TCM symptom normalization model for normalizing TCM synonymous symptoms to unified literal expressions. Methods. Four types of TCM symptom normalization models, based on natural language processing, were constructed to find a high-quality one: (1) a text sequence generation model based on a bidirectional long short-term memory (Bi-LSTM) neural network with an encoder-decoder structure; (2) a text classification model based on a Bi-LSTM neural network and sigmoid function; (3) a text sequence generation model based on bidirectional encoder representation from transformers (BERT) with sequence-to-sequence training method of unified language model (BERT-UniLM); (4) a text classification model based on BERT and sigmoid function (BERT-Classification). The performance of the models was compared using four metrics: accuracy, recall, precision, and F1-score. Results. The BERT-Classification model outperformed the models based on Bi-LSTM and BERT-UniLM with respect to the four metrics. Conclusions. The BERT-Classification model has superior performance in normalizing expressions of TCM synonymous symptoms.

Download Full-text

An empirical comparison of distance/similarity measures for Natural Language Processing

10.5753/eniac.2019.9328 ◽

2019 ◽

Author(s):

Dimmy Magalhães ◽

Aurora Pozo ◽

Roberto Santana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Classification ◽

Euclidean Distance ◽

Similarity Measures ◽

Convolutional Networks ◽

Statistical Similarity ◽

The Impact ◽

Better Than

Text Classification is one of the tasks of Natural Language Processing (NLP). In this area, Graph Convolutional Networks (GCN) has achieved values higher than CNN's and other related models. For GCN, the metric that defines the correlation between words in a vector space plays a crucial role in the classification because it determines the weight of the edges between two words (represented by nodes in the graph). In this study, we empirically investigated the impact of thirteen measures of distance/similarity. A representation was built for each document using word embedding from word2vec model. Also, a graph-based representation of five dataset was created for each measure analyzed, where each word is a node in the graph, and each edge is weighted by distance/similarity between words. Finally, each model was run in a simple graph neural network. The results show that, concerning text classification, there is no statistical difference between the analyzed metrics and the Graph Convolution Network. Even with the incorporation of external words or external knowledge, the results were similar to the methods without the incorporation of words. However, the results indicate that some distance metrics behave better than others in relation to context capture, with Euclidean distance reaching the best values or having statistical similarity with the best.

Download Full-text

Data Mining

Applied Natural Language Processing ◽

10.4018/978-1-60960-741-8.ch005 ◽

2012 ◽

pp. 75-94 ◽

Cited By ~ 1

Author(s):

Martin Atzmueller

Keyword(s):

Data Mining ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Classification ◽

Pattern Mining ◽

Data Sources ◽

Future Research ◽

Research Directions ◽

Future Research Directions

Data Mining provides approaches for the identification and discovery of non-trivial patterns and models hidden in large collections of data. In the applied natural language processing domain, data mining usually requires preprocessed data that has been extracted from textual documents. Additionally, this data is often integrated with other data sources. This chapter provides an overview on data mining focusing on approaches for pattern mining, cluster analysis, and predictive model construction. For those, we discuss exemplary techniques that are especially useful in the applied natural language processing context. Additionally, we describe how the presented data mining approaches are connected to text mining, text classification, and clustering, and discuss interesting problems and future research directions.

Download Full-text

Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism

Applied Sciences ◽

10.3390/app10175841 ◽

2020 ◽

Vol 10 (17) ◽

pp. 5841 ◽

Cited By ~ 4

Author(s):

Beakcheol Jang ◽

Myeonghwi Kim ◽

Gaspard Harerimana ◽

Sang-ug Kang ◽

Jong Wook Kim

Keyword(s):

Neural Network ◽

Natural Language Processing ◽

Natural Language ◽

Convolutional Neural Network ◽

Language Processing ◽

Text Classification ◽

Short Term Memory ◽

User Behavior ◽

Hybrid Approach ◽

Attention Mechanism

There is a need to extract meaningful information from big data, classify it into different categories, and predict end-user behavior or emotions. Large amounts of data are generated from various sources such as social media and websites. Text classification is a representative research topic in the field of natural-language processing that categorizes unstructured text data into meaningful categorical classes. The long short-term memory (LSTM) model and the convolutional neural network for sentence classification produce accurate results and have been recently used in various natural-language processing (NLP) tasks. Convolutional neural network (CNN) models use convolutional layers and maximum pooling or max-overtime pooling layers to extract higher-level features, while LSTM models can capture long-term dependencies between word sequences hence are better used for text classification. However, even with the hybrid approach that leverages the powers of these two deep-learning models, the number of features to remember for classification remains huge, hence hindering the training process. In this study, we propose an attention-based Bi-LSTM+CNN hybrid model that capitalize on the advantages of LSTM and CNN with an additional attention mechanism. We trained the model using the Internet Movie Database (IMDB) movie review data to evaluate the performance of the proposed model, and the test results showed that the proposed hybrid attention Bi-LSTM+CNN model produces more accurate classification results, as well as higher recall and F1 scores, than individual multi-layer perceptron (MLP), CNN or LSTM models as well as the hybrid models.

Download Full-text