Intent Identification in Unattended Customer Queries Using an Unsupervised Approach

Customer’s satisfaction is crucial for companies worldwide. An integrated strategy composes omnichannel communication systems, in which chabot is widely used. This system is supervised, and the key point is that the required training data are originally unlabelled. Labelling data manually is unfeasible mainly nowadays due to the considerable volume. Moreover, customer behaviour is often hidden in the data even for experts. This work proposes a methodology to find unknown entities and intents automatically using unsupervised learning. This is based on natural language processing (NLP) for text data preparation and on machine learning (ML) for clustering model identification. Several combinations for preprocessing, vectorisation, dimensionality reduction and clustering techniques, were investigated. The case study refers to a Brazilian electric energy company, with a data set of failed customer queries, that is, not met by the company for any reason. They correspond to about 30% (4,044 queries) of the original data set. The best identified intent model employed stemming for preprocessing, word frequency analysis for vectorisation, latent Dirichlet allocation (LDA) for dimensionality reduction, and mini-batch [Formula: see text]-means for clustering. This system was able to allocate 62% of the failed queries in one of the seven found intents. For instance, this new labelled data can be used for the training of NLP-based chatbots contributing to a greater generalisation capacity, and ultimately, to increase customer satisfaction.

Download Full-text

Deep residual detection of radio frequency interference for FAST

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz3521 ◽

2020 ◽

Vol 492 (1) ◽

pp. 1421-1431 ◽

Cited By ~ 4

Author(s):

Zhicheng Yang ◽

Ce Yu ◽

Jian Xiao ◽

Bo Zhang

Keyword(s):

Radio Frequency ◽

Large Data ◽

High Sensitivity ◽

Original Data ◽

Training Data ◽

Radio Frequency Interference ◽

Data Sets ◽

Data Set ◽

Time Required ◽

Key Steps

ABSTRACT Radio frequency interference (RFI) detection and excision are key steps in the data-processing pipeline of the Five-hundred-meter Aperture Spherical radio Telescope (FAST). Because of its high sensitivity and large data rate, FAST requires more accurate and efficient RFI flagging methods than its counterparts. In the last decades, approaches based upon artificial intelligence (AI), such as codes using convolutional neural networks (CNNs), have been proposed to identify RFI more reliably and efficiently. However, RFI flagging of FAST data with such methods has often proved to be erroneous, with further manual inspections required. In addition, network construction as well as preparation of training data sets for effective RFI flagging has imposed significant additional workloads. Therefore, rapid deployment and adjustment of AI approaches for different observations is impractical to implement with existing algorithms. To overcome such problems, we propose a model called RFI-Net. With the input of raw data without any processing, RFI-Net can detect RFI automatically, producing corresponding masks without any alteration of the original data. Experiments with RFI-Net using simulated astronomical data show that our model has outperformed existing methods in terms of both precision and recall. Besides, compared with other models, our method can obtain the same relative accuracy with fewer training data, thus reducing the effort and time required to prepare the training data set. Further, the training process of RFI-Net can be accelerated, with overfittings being minimized, compared with other CNN codes. The performance of RFI-Net has also been evaluated with observing data obtained by FAST and the Bleien Observatory. Our results demonstrate the ability of RFI-Net to accurately identify RFI with fine-grained, high-precision masks that required no further modification.

Download Full-text

Leveraging Natural Language Processing to Understand Public Outlook Towards the Influenza Vaccination

10.36227/techrxiv.13607258 ◽

2021 ◽

Author(s):

Ankita Agarwal ◽

William Romine ◽

Tanvi Banerjee

Keyword(s):

Social Media ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Temporal Trend ◽

Healthcare Management ◽

Data Set ◽

Vaccine Preventable Diseases ◽

The Public ◽

Social Media Platforms ◽

Flu Vaccine

<div>Understanding public outlook in healthcare management is important in the study of the various diseases. With respect to vaccinations, which play a major role in combating vaccine-preventable diseases, the study on their acceptance or rejection by the public becomes useful. In particular to the</div><div>influenza vaccine, studies on the public opinion and views is ongoing. Social media platforms like Twitter help us to leverage thoughts and attitudes related to the flu vaccine. The data set used for our analysis contained tweets related to vaccines which were collected using vaccine-related keywords over a period of twelve months from February, 2018 to January, 2019. Out of these tweets, we filtered out the tweets specific to the flu vaccine and generated our corpus for further study. By using Latent Dirichlet Allocation (LDA), we identified eighteen topics comprising six major themes which best represented our corpus. In this paper, we discuss these six themes and subsequently analyze the trend observed in these themes over a period of twelve months. The themes identified covered various aspects related to the flu vaccine. Among the six major themes, four showed a distinctive temporal trend with respect to the annual flu season.</div><div><br></div>

Download Full-text

Radiomics-based prediction of hemorrhage expansion among patients with thrombolysis/thrombectomy related-hemorrhagic transformation using machine learning

Therapeutic Advances in Neurological Disorders ◽

10.1177/17562864211060029 ◽

2021 ◽

Vol 14 ◽

pp. 175628642110600

Author(s):

Junfeng Liu ◽

Wendan Tao ◽

Zhetao Wang ◽

Xinyue Chen ◽

Bo Wu ◽

...

Keyword(s):

Machine Learning ◽

Ischemic Stroke ◽

Calibration Curve ◽

Original Data ◽

Hemorrhagic Transformation ◽

Training Data ◽

Lasso Regression ◽

Clinical Value ◽

Brain Images ◽

Data Set

Introduction: Patients with hemorrhagic transformation (HT) were reported to have hemorrhage expansion. However, identification these patients with high risk of hemorrhage expansion has not been well studied. Objectives: We aimed to develop a radiomic score to predict hemorrhage expansion after HT among patients treated with thrombolysis/thrombectomy during acute phase of ischemic stroke. Methods: A total of 104 patients with HT after reperfusion treatment from the West China hospital, Sichuan University, were retrospectively included in this study between 1 January 2012 and 31 December 2020. The preprocessed initial non-contrast-enhanced computed tomography (NECT) imaging brain images were used for radiomic feature extraction. A synthetic minority oversampling technique (SMOTE) was applied to the original data set. The after-SMOTE data set was randomly split into training and testing cohorts with an 8:2 ratio by a stratified random sampling method. The least absolute shrinkage and selection operator (LASSO) regression were applied to identify candidate radiomic features and construct the radiomic score. The performance of the score was evaluated by receiver operating characteristic (ROC) analysis and a calibration curve. Decision curve analysis (DCA) was performed to evaluate the clinical value of the model. Results: Among the 104 patients, 17 patients were identified with hemorrhage expansion after HT detection. A total of 154 candidate predictors were extracted from NECT images and five optimal features were ultimately included in the development of the radiomic score by using logistic regression machine-learning approach. The radiomic score showed good performance with high area under the curves in both the training data set (0.91, sensitivity: 0.83; specificity: 0.89), test data set (0.87, sensitivity: 0.60; specificity: 0.85), and original data set (0.82, sensitivity: 0.77; specificity: 0.78). The calibration curve and DCA also indicated that there was a high accuracy and clinical usefulness of the radiomic score for hemorrhage expansion prediction after HT. Conclusions: The currently established NECT-based radiomic score is valuable in predicting hemorrhage expansion after HT among patients treated with reperfusion treatment after ischemic stroke, which may aid clinicians in determining patients with HT who are most likely to benefit from anti-expansion treatment.

Download Full-text

A MULTIPLE CLASSIFIER SYSTEM USING AMBIGUITY REJECTION FOR CLUSTERING-CLASSIFICATION COOPERATION

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s021848850000054x ◽

2000 ◽

Vol 08 (06) ◽

pp. 747-762 ◽

Cited By ~ 2

Author(s):

VEYIS GUNES ◽

MICHEL MENARD ◽

PIERRE LOONIS

Keyword(s):

Fuzzy Clustering ◽

Feature Space ◽

Original Data ◽

Point Of View ◽

Training Data ◽

Data Set ◽

Classifier Selection ◽

Fuzzy Clustering Method ◽

Data Clusters ◽

Multiple Classifier

This article aims at showing that supervised and unsupervised learnings are not competitive, but complementary methods. We propose to use a fuzzy clustering method with ambiguity rejection to guide the supervised learning performed by bayesian classifiers. This method detects ambiguous or mixed areas of a learning set. The problem is seen from the multi-decision point of view (i.e. several classification modules). Each classification module is specialized on a particular region of the feature space. These regions are obtained by fuzzy clustering and constitute the original data set by union. A classifier is associated with each cluster. The training set for each classifier is then defined on the cluster and its associated ambiguous clusters. The overall system is parallel, since different classifiers work with their own training data clusters. The algorithm makes possible the adaptive classifier selection in the sense that the fuzzy clustering with ambiguity rejection gives adapted training data regions of the feature space. The decision making is the fusion of outputs from the most adapted classifiers.

Download Full-text

Translating Videos into Synthetic Training Data for Wearable Sensor-Based Activity Recognition Systems Using Residual Deep Convolutional Networks

Applied Sciences ◽

10.3390/app11073094 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3094

Author(s):

Vitor Fortes Rey ◽

Kamalveer Kaur Garewal ◽

Paul Lukowicz

Keyword(s):

Computer Vision ◽

Regression Model ◽

Activity Recognition ◽

Language Processing ◽

Large Scale ◽

Simulated Data ◽

Training Data ◽

Sensor Data ◽

Activity Data ◽

Data Set

Human activity recognition (HAR) using wearable sensors has benefited much less from recent advances in Deep Learning than fields such as computer vision and natural language processing. This is, to a large extent, due to the lack of large scale (as compared to computer vision) repositories of labeled training data for sensor-based HAR tasks. Thus, for example, ImageNet has images for around 100,000 categories (based on WordNet) with on average 1000 images per category (therefore up to 100,000,000 samples). The Kinetics-700 video activity data set has 650,000 video clips covering 700 different human activities (in total over 1800 h). By contrast, the total length of all sensor-based HAR data sets in the popular UCI machine learning repository is less than 63 h, with around 38 of those consisting of simple mode of locomotion activities like walking, standing or cycling. In our research we aim to facilitate the use of online videos, which exist in ample quantities for most activities and are much easier to label than sensor data, to simulate labeled wearable motion sensor data. In previous work we already demonstrated some preliminary results in this direction, focusing on very simple, activity specific simulation models and a single sensor modality (acceleration norm). In this paper, we show how we can train a regression model on generic motions for both accelerometer and gyro signals and then apply it to videos of the target activities to generate synthetic Inertial Measurement Units (IMU) data (acceleration and gyro norms) that can be used to train and/or improve HAR models. We demonstrate that systems trained on simulated data generated by our regression model can come to within around 10% of the mean F1 score of a system trained on real sensor data. Furthermore, we show that by either including a small amount of real sensor data for model calibration or simply leveraging the fact that (in general) we can easily generate much more simulated data from video than we can collect its real version, the advantage of the latter can eventually be equalized.

Download Full-text

LARGE‐SCALE DATA VISUALIZATION WITH MISSING VALUES

Technological and Economic Development of Economy ◽

10.3846/13928619.2006.9637721 ◽

2006 ◽

Vol 12 (1) ◽

pp. 44-49

Author(s):

Sergiy Popov

Keyword(s):

Dimensionality Reduction ◽

Large Scale ◽

Missing Values ◽

Original Data ◽

Nonlinear Dimensionality Reduction ◽

Data Sets ◽

Data Set ◽

Large Scale Data ◽

Learning Procedure ◽

Scale Data

Visualization of large‐scale data inherently requires dimensionality reduction to 1D, 2D, or 3D space. Autoassociative neural networks with a bottleneck layer are commonly used as a nonlinear dimensionality reduction technique. However, many real‐world problems suffer from incomplete data sets, i.e. some values can be missing. Common methods dealing with missing data include the deletion of all cases with missing values from the data set or replacement with mean or “normal” values for specific variables. Such methods are appropriate when just a few values are missing. But in the case when a substantial portion of data is missing, these methods can significantly bias the results of modeling. To overcome this difficulty, we propose a modified learning procedure for the autoassociative neural network that directly takes the missing values into account. The outputs of the trained network may be used for substitution of the missing values in the original data set.

Download Full-text

Network Abnormal Data Detection Based on Deep Learning Model

CONVERTER ◽

10.17762/converter.266 ◽

2021 ◽

pp. 64-73

Author(s):

Yang Dong

Keyword(s):

Deep Learning ◽

Real Time ◽

Detection System ◽

Feature Learning ◽

Original Data ◽

Detection Algorithm ◽

Training Data ◽

Data Detection ◽

Classification Problems ◽

Data Set

To improve intrusion detection system performance,many algorithms are used to improve the performance of IDS systems, especially deep learning models. This paper presents an algorithm based on the model MLP, the training data set is the KDD99 data set, and the original data of the data set is vectorized by one-hot encoding, and the feature data is processed by Z-Score, and then the feature vector is encoded, and then the multi-layer perception is used The machine network performs feature learning, and finally trains the classifier model for detection. Traditional network anomaly detection algorithm models mainly use manual selection methods, and the accuracy and efficiency of classification problems are not high. This article first proposed the role of multilayer perceptron in Adam optimizer. The test of the KDD99 data set has been completed. The algorithm accuracy rate can reach 99%. For future network abnormal data detection work, an algorithm model that can realize real-time online detection is provided, which will have higher accuracy and better real-time performance.

Download Full-text

Document Similarity Measure Based on Topic Model

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.513-517.1280 ◽

2014 ◽

Vol 513-517 ◽

pp. 1280-1284

Author(s):

Ming He ◽

Zhen Zhen Wang ◽

Yong Ping Du

Keyword(s):

Language Processing ◽

Latent Dirichlet Allocation ◽

Question Answering ◽

Topic Model ◽

Real Data ◽

Space Representation ◽

Data Set ◽

Document Similarity ◽

Fuzzy Query ◽

Document Categorization

Document similarity computation is an exciting research topic in information retrieval (IR) and it is a key issue for automatic document categorization, clustering analysis, fuzzy query and question answering. Topic model is an emerging field in natural language processing (NLP), IR and machine learning (ML). In this paper, we apply a latent Dirichlet allocation (LDA) topic model-based method to compute similarity between documents. By mapping a document with term space representation into a topic space, a distribution over topics derived for computing document similarity. An empirical study using real data set demonstrates the efficiency of our method.

Download Full-text

Analyzing U.S. Army Officer Evaluation Reports with Natural Language Processing: A Log-Odds and Latent Dirichlet Allocation Exploration

Industrial and Systems Engineering Review ◽

10.37266/iser.2019v7i1.pp44-55 ◽

2019 ◽

Vol 7 (1) ◽

pp. 44-55

Author(s):

Heidy Shi ◽

John Caddell ◽

Julia Lensing

Keyword(s):

Natural Language Processing ◽

Text Mining ◽

Language Processing ◽

Text Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Data Set ◽

Army Officer ◽

Log Odds ◽

Dirichlet Allocation

Each job field (branch) in the Army requires a unique set of skills and talents of the officers assigned. Officers who demonstrate the required skills are often more successful in their assigned branch. To better understand how success is described across branches, research was conducted using text mining and text analysis of a data set of Officer Evaluation Reports (OERs). This research looked for common trends and discrepancies across varying branches and like groups of branches by analyzing the narrative portion of OERs. Text analysis methods examined words and bigrams commonly used to describe varying degrees of performance by officers. Topic modeling using Latent Dirichlet Allocation (LDA) was also conducted on top rated narratives to investigate trends and discrepancies in clustering narratives. Findings show that qualitative narratives for the top two performance designations fail to differentiate between officers’ varying levels of performance regardless of branch.

Download Full-text