A Hybrid Ensemble Word Embedding based Classification Model for Multi-document Summarization Process on Large Multi-domain Document Sets

Multi-document summarization transforms a set of related documents into one concise summary. Existing Indonesian news articles summarizations do not take relationships between sentences into account and heavily depends on Indonesian language tools and resources. In this paper, we employ Graph Convolutional Network (GCN) which accepts word embedding sequence and sentence relationship graph as input for Indonesian news articles summarization. Our system is comprised of four main components, which are preprocess, graph construction, sentence scoring, and sentence selection components. Sentence scoring component is a neural network that uses Recurrent Neural Network (RNN) and GCN to produce the scores of all sentences. We use three different representation types for the sentence relationship graph. Sentence selection component then generates summary with two different techniques, which are by greedily choosing sentences with the highest scores and by using Maximum Marginal Relevance (MMR) technique. The evaluation shows that GCN summarizer with Personalized Discourse Graph (PDG) graph representation system achieves the best results with average ROUGE-2 recall score of 0.370 for 100-word summary and 0.378 for 200-word summary. Sentence selection using greedy technique gives better results for generating 100-word summary, while MMR performs better for generating 200-word summary.

Download Full-text

A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Electronics ◽

10.3390/electronics9091527 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1527 ◽

Cited By ~ 1

Author(s):

Han-Sub Shin ◽

Hyuk-Yoon Kwon ◽

Seung-Jin Ryu

Keyword(s):

Deep Learning ◽

Text Classification ◽

Area Under The Curve ◽

Word Embedding ◽

Classification Model ◽

Data Set ◽

Feature Vectors ◽

Model Based ◽

Proposed Model ◽

The Difference

Detecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we propose a novel word embedding model, called contrastive word embedding, that enables to maximize the difference between base embedding models. First, we define CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we additionally employ the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we adopt the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier. To validate the effectiveness of the proposed model, we compare our method with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of area under the curve (AUC), which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.

Download Full-text