topic distribution Latest Research Papers

Interpretable Aspect-Aware Capsule Network for Peer Review Based Citation Count Prediction

ACM Transactions on Information Systems ◽

10.1145/3466640 ◽

2022 ◽

Vol 40 (1) ◽

pp. 1-29

Author(s):

Siqing Li ◽

Yaliang Li ◽

Wayne Xin Zhao ◽

Bolin Ding ◽

Ji-Rong Wen

Keyword(s):

Peer Review ◽

Prediction Models ◽

Citation Count ◽

Specific Aspect ◽

Semantic Features ◽

Predictive Capacity ◽

Topic Distribution ◽

Real World Datasets ◽

Data Signal ◽

High Level

Citation count prediction is an important task for estimating the future impact of research papers. Most of the existing works utilize the information extracted from the paper itself. In this article, we focus on how to utilize another kind of useful data signal (i.e., peer review text) to improve both the performance and interpretability of the prediction models. Specially, we propose a novel aspect-aware capsule network for citation count prediction based on review text. It contains two major capsule layers, namely the feature capsule layer and the aspect capsule layer, with two different routing approaches, respectively. Feature capsules encode the local semantics from review sentences as the input of aspect capsule layer, whereas aspect capsules aim to capture high-level semantic features that will be served as final representations for prediction. Besides the predictive capacity, we also enhance the model interpretability with two strategies. First, we use the topic distribution of the review text to guide the learning of aspect capsules so that each aspect capsule can represent a specific aspect in the review. Then, we use the learned aspect capsules to generate readable text for explaining the predicted citation count. Extensive experiments on two real-world datasets have demonstrated the effectiveness of the proposed model in both performance and interpretability.

CATS: Customizable Abstractive Topic-based Summarization

ACM Transactions on Information Systems ◽

10.1145/3464299 ◽

2022 ◽

Vol 40 (1) ◽

pp. 1-24

Author(s):

Seyed Ali Bahrainian ◽

George Zerveas ◽

Fabio Crestani ◽

Carsten Eickhoff

Keyword(s):

Computer Science ◽

State Of The Art ◽

Original Text ◽

Learning Method ◽

Source Text ◽

Resource Setting ◽

Low Resource Setting ◽

Topic Distribution ◽

Latent Topic ◽

Abstractive Summarization

Neural sequence-to-sequence models are the state-of-the-art approach used in abstractive summarization of textual documents, useful for producing condensed versions of source text narratives without being restricted to using only words from the original text. Despite the advances in abstractive summarization, custom generation of summaries (e.g., towards a user’s preference) remains unexplored. In this article, we present CATS, an abstractive neural summarization model that summarizes content in a sequence-to-sequence fashion while also introducing a new mechanism to control the underlying latent topic distribution of the produced summaries. We empirically illustrate the efficacy of our model in producing customized summaries and present findings that facilitate the design of such systems. We use the well-known CNN/DailyMail dataset to evaluate our model. Furthermore, we present a transfer-learning method and demonstrate the effectiveness of our approach in a low resource setting, i.e., abstractive summarization of meetings minutes, where combining the main available meetings’ transcripts datasets, AMI and International Computer Science Institute(ICSI) , results in merely a few hundred training documents.

The short texts classification based on neural network topic model

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211471 ◽

2021 ◽

pp. 1-13

Author(s):

Dangguo Shao ◽

Chengyao Li ◽

Chusheng Huang ◽

Qing An ◽

Yan Xiang ◽

...

Keyword(s):

Neural Network ◽

Feature Extraction ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Dirichlet Distribution ◽

Gaussian Mixture ◽

Classification Model ◽

Variational Bayesian ◽

Model Based ◽

Topic Distribution

Aiming at the low effectiveness of short texts feature extraction, this paper proposes a short texts classification model based on the improved Wasserstein-Latent Dirichlet Allocation (W-LDA), which is a neural network topic model based on the Wasserstein Auto-Encoder (WAE) framework. The improvements of W-LDA are as follows: Firstly, the Bag of Words (BOW) input in the W-LDA is preprocessed by Term Frequency–Inverse Document Frequency (TF-IDF); Subsequently, the prior distribution of potential topics in W-LDA is replaced from the Dirichlet distribution to the Gaussian mixture distribution, which is based on the Variational Bayesian inference; And then the sparsemax function layer is introduced after the hidden layer inferred by the encoder network to generate a sparse document-topic distribution with better topic relevance, the improved W-LDA is named the Sparse Wasserstein-Variational Bayesian Gaussian mixture model (SW-VBGMM); Finally, the document-topic distribution generated by SW-VBGMM is input to BiGRU (Bidirectional Gating Recurrent Unit) for the deep feature extraction and the short texts classification. Experiments on three Chinese short texts datasets and one English dataset represent that our model is better than some common topic models and neural network models in the four evaluation indexes (accuracy, precision, recall, F1 value) of text classification.

Constructing a multiple-layer interactome for SARS-CoV-2 in the context of lung disease: Linking the virus with human genes and co-infecting microbes

10.1101/2021.12.05.471290 ◽

2021 ◽

Author(s):

Shaoke Lou ◽

Tianxiao Li ◽

Mark Gerstein

Keyword(s):

Viral Infection ◽

Latent Dirichlet Allocation ◽

Data Types ◽

Protein Coding ◽

Individual Level ◽

Protein Protein Interaction ◽

Treatment And Prevention ◽

Multiple Layer ◽

Network Propagation ◽

Topic Distribution

AbstractThe severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has caused millions of deaths worldwide. Many efforts have focused on unraveling the mechanism of the viral infection to develop effective strategies for treatment and prevention. Previous studies have provided some clarity on the protein-protein interaction linkages occurring during the life cycle of viral infection; however, we lack a complete understanding of the full interactome, comprising human miRNAs and protein-coding genes and co-infecting microbes. To comprehensively determine this, we developed a statistical modeling method using latent Dirichlet allocation (called MLCrosstalk, for multiple-layer crosstalk) to fuse many types of data to construct the full interactome of SARS-CoV-2. Specifically, MLCrosstalk is able to integrate samples with multiple layers of information (e.g., miRNA and microbes), enforce a consistent topic distribution on all data types, and infer individual-level linkages (i.e., differing between patients). We also implement a secondary refinement with network propagation to allow our microbe-gene linkages to address larger network structures (e.g., pathways). Using MLCrosstalk, we generated a list of genes and microbes linked to SARS-CoV-2. Interestingly, we found that two of the identified microbes, Rothia mucilaginosa and Prevotella melaninogenica, show distinct patterns representing synergistic and antagonistic relationships with the virus, respectively. We also identified several SARS-COV-2-associated pathways, including the VEGFA-VEGFR2 and immune response pathways, which may provide potential targets for drug design.

Evolution of topics and hate speech in retweet network communities

Applied Network Science ◽

10.1007/s41109-021-00439-7 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Bojan Evkoski ◽

Nikola Ljubešić ◽

Andraž Pelicon ◽

Igor Mozetič ◽

Petra Kralj Novak

Keyword(s):

Hate Speech ◽

Temporal Evolution ◽

Three Dimensions ◽

Temporal Dimension ◽

Community Membership ◽

Twitter Data ◽

Network Communities ◽

Topic Distribution ◽

Textual Content

AbstractTwitter data exhibits several dimensions worth exploring: a network dimension in the form of links between the users, textual content of the tweets posted, and a temporal dimension as the time-stamped sequence of tweets and their retweets. In the paper, we combine analyses along all three dimensions: temporal evolution of retweet networks and communities, contents in terms of hate speech, and discussion topics. We apply the methods to a comprehensive set of all Slovenian tweets collected in the years 2018–2020. We find that politics and ideology are the prevailing topics despite the emergence of the Covid-19 pandemic. These two topics also attract the highest proportion of unacceptable tweets. Through time, the membership of retweet communities changes, but their topic distribution remains remarkably stable. Some retweet communities are strongly linked by external retweet influence and form super-communities. The super-community membership closely corresponds to the topic distribution: communities from the same super-community are very similar by the topic distribution, and communities from different super-communities are quite different in terms of discussion topics. However, we also find that even communities from the same super-community differ considerably in the proportion of unacceptable tweets they post.

Klasifikasi Laporan Keluhan Pelayanan Publik Berdasarkan Instansi Menggunakan Metode LDA-SVM

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2021863768 ◽

2021 ◽

Vol 8 (6) ◽

pp. 1265

Author(s):

Muhammad Alkaff ◽

Andreyan Rizky Baskara ◽

Irham Maulani

Keyword(s):

Support Vector Machine ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Confusion Matrix ◽

Support Vector ◽

Topic Distribution ◽

The Government ◽

Dirichlet Allocation

Sebuah sistem layanan untuk menyampaikan aspirasi dan keluhan masyarakat terhadap layanan pemerintah Indonesia, bernama Lapor! Pemerintah sudah lama memanfaatkan sistem tersebut untuk menjawab permasalahan masyarakat Indonesia terkait permasalahan birokrasi. Namun, peningkatan volume laporan dan pemilahan laporan yang dilakukan oleh operator dengan membaca setiap keluhan yang masuk melalui sistem menyebabkan sering terjadi kesalahan dimana operator meneruskan laporan tersebut ke instansi yang salah. Oleh karena itu, diperlukan suatu solusi yang dapat menentukan konteks laporan secara otomatis dengan menggunakan teknik Natural Language Processing. Penelitian ini bertujuan untuk membangun klasifikasi laporan secara otomatis berdasarkan topik laporan yang ditujukan kepada instansi yang berwenang dengan menggabungkan metode Latent Dirichlet Allocation (LDA) dan Support Vector Machine (SVM). Proses pemodelan topik untuk setiap laporan dilakukan dengan menggunakan metode LDA. Metode ini mengekstrak laporan untuk menemukan pola tertentu dalam dokumen yang akan menghasilkan keluaran dalam nilai distribusi topik. Selanjutnya, proses klasifikasi untuk menentukan laporan agensi tujuan dilakukan dengan menggunakan SVM berdasarkan nilai topik yang diekstraksi dengan metode LDA. Performa model LDA-SVM diukur dengan menggunakan confusion matrix dengan menghitung nilai akurasi, presisi, recall, dan F1 Score. Hasil pengujian menggunakan teknik split train-test dengan skor 70:30 menunjukkan bahwa model menghasilkan kinerja yang baik dengan akurasi 79,85%, presisi 79,98%, recall 72,37%, dan Skor F1 74,67%. AbstractA service system to convey aspirations and complaints from the public against Indonesia's government services, named Lapor! The Government has used the Government for a long time to answer the problems of the Indonesian people related to bureaucratic problems. However, the increasing volume of reports and the sorting of reports carried out by operators by reading every complaint that comes through the system cause frequent errors where operators forward the reports to the wrong agencies. Therefore, we need a solution that can automatically determine the report's context using Natural Language Processing techniques. This study aims to build automatic report classifications based on report topics addressed to authorized agencies by combining Latent Dirichlet Allocation (LDA) and Support Vector Machine (SVM). The topic-modeling process for each report was carried out using the LDA method. This method extracts reports to find specific patterns in documents that will produce output in topic distribution values. Furthermore, the classification process to determine the report's destination agency carried out using the SVM based on the value of the topics extracted by the LDA method. The LDA-SVM model's performance is measured using a confusion matrix by calculating the value of accuracy, precision, recall, and F1 Score. The test results using the train-test split technique with a 70:30 show that the model produces good performance with 79.85% accuracy, 79.98% precision, 72.37% recall, and 74.67% F1 Score

A Novel Hypercube-based Approach to Overlay Design Algorithms on Topic Distribution Networks

Journal of Polytechnic ◽

10.2339/politeknik.823124 ◽

2021 ◽

Author(s):

Semih YUMUŞAK ◽

Sina LAYAZALİ ◽

Kasım ÖZTOPRAK ◽

Reza HASSANPOUR

Keyword(s):

Distribution Networks ◽

Topic Distribution ◽

Overlay Design ◽

Design Algorithms

Learn From the Rumors

Journal of Database Management ◽

10.4018/jdm.2021070103 ◽

2021 ◽

Vol 32 (3) ◽

pp. 46-68

Author(s):

Fei Liu ◽

Meiyun Zuo

Keyword(s):

Global Public Health ◽

Public Health Emergencies ◽

The United Kingdom ◽

Global Pandemic ◽

Topic Distribution ◽

Temporal Aspect ◽

Spatio Temporal ◽

Mining Model ◽

The Comparative Study ◽

Full Fact

The COVID-19 pandemic is an ongoing global pandemic, which has caused global social and economic disruption. In addition to physical illness, people have to endure the intrusion of rumors psychologically. Thus, it is critical to summarize the correlating infodemic, a significant part of COVID-19, to eventually defeat the epidemic. This article aims to mine the topic distribution and evolution patterns of online rumors by comparing and contrasting COVID-19 rumors from the two most popular rumor-refuting platforms—Jiaozhen in China and Full Fact in the United Kingdom (UK)—via a novel topic mining model, text clustering based on bidirectional encoder representations from transformers (BERT), and lifecycle theory. This comparison and contrast can enrich the research of infodemiology based on the spatio-temporal aspect, providing practical guidance for governments, rumor-refuting platforms, and individuals. The comparative study highlights the similarities and differences of online rumors about global public health emergencies across countries.

A Big Data Text Coverless Information Hiding Based on Topic Distribution and TF-IDF

International Journal of Digital Crime and Forensics ◽

10.4018/ijdcf.20210701.oa4 ◽

2021 ◽

Vol 13 (4) ◽

pp. 40-56

Author(s):

Jiaohua Qin ◽

Zhuo Zhou ◽

Yun Tan ◽

Xuyu Xiang ◽

Zhibin He

Keyword(s):

Big Data ◽

Information Hiding ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Hiding Capacity ◽

Inverse Document Frequency ◽

Secret Information ◽

Document Frequency ◽

Topic Distribution ◽

Coverless Information Hiding

Coverless information hiding has become a hot topic in recent years. The existing steganalysis tools are invalidated due to coverless steganography without any modification to the carrier. However, for the text coverless has relatively low hiding capacity, this paper proposed a big data text coverless information hiding method based on LDA (latent Dirichlet allocation) topic distribution and keyword TF-IDF (term frequency-inverse document frequency). Firstly, the sender and receiver build codebook, including word segmentation, word frequency and TF-IDF features, LDA topic model clustering. The sender then shreds the secret information, converts it into keyword ID through the keywords-index table, and searches the text containing the secret information keywords. Secondly, the searched text is taken as the index tag according to the topic distribution and TF-IDF features. At the same time, random numbers are introduced to control the keyword order of secret information.

Detecting Abnormal Social Network Accounts with Hurst of Interest Distribution

Security and Communication Networks ◽

10.1155/2021/6653430 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Xiujuan Wang ◽

Yi Sui ◽

Yuanrui Tao ◽

Qianqian Zhang ◽

Jianhua Wei

Keyword(s):

Social Networks ◽

Latent Dirichlet Allocation ◽

Rapid Development ◽

Research Work ◽

Social Engineering ◽

User Interest ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Topic Distribution ◽

The Stability

With the rapid development of the Internet since the beginning of the 21st century, social networks have provided a significant amount of convenience for work, study, and entertainment. Specifically, because of the irreplaceable superiority of social platforms in disseminating information, criminals have thus updated the main methods of social engineering attacks. Detecting abnormal accounts on social networks in a timely manner can effectively prevent the occurrence of malicious Internet events. Different from previous research work, in this work, a method of anomaly detection called Hurst of Interest Distribution is proposed based on the stability of user interest quantifiable from the content of users’ tweets, so as to detect abnormal accounts. In detail, the Latent Dirichlet Allocation model is adopted to classify blog content on Twitter into topics to calculate and obtain the topic distribution of tweets sent by a single user within a period of time. Then, the stability degree of the user’s tweet topic preference is calculated according to the Hurst index to determine whether the account is compromised. Through experiments, the Hurst indexes of normal and abnormal accounts are found to be significantly different, and the detection rate of abnormal accounts using the proposed method can reach up to 97.93%.

topic distribution
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Interpretable Aspect-Aware Capsule Network for Peer Review Based Citation Count Prediction

CATS: Customizable Abstractive Topic-based Summarization

The short texts classification based on neural network topic model

Constructing a multiple-layer interactome for SARS-CoV-2 in the context of lung disease: Linking the virus with human genes and co-infecting microbes

Evolution of topics and hate speech in retweet network communities

Klasifikasi Laporan Keluhan Pelayanan Publik Berdasarkan Instansi Menggunakan Metode LDA-SVM

A Novel Hypercube-based Approach to Overlay Design Algorithms on Topic Distribution Networks

Learn From the Rumors

A Big Data Text Coverless Information Hiding Based on Topic Distribution and TF-IDF

Detecting Abnormal Social Network Accounts with Hurst of Interest Distribution

Export Citation Format

topic distributionRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Interpretable Aspect-Aware Capsule Network for Peer Review Based Citation Count Prediction

CATS: Customizable Abstractive Topic-based Summarization

The short texts classification based on neural network topic model

Constructing a multiple-layer interactome for SARS-CoV-2 in the context of lung disease: Linking the virus with human genes and co-infecting microbes

Evolution of topics and hate speech in retweet network communities

Klasifikasi Laporan Keluhan Pelayanan Publik Berdasarkan Instansi Menggunakan Metode LDA-SVM

A Novel Hypercube-based Approach to Overlay Design Algorithms on Topic Distribution Networks

Learn From the Rumors

A Big Data Text Coverless Information Hiding Based on Topic Distribution and TF-IDF

Detecting Abnormal Social Network Accounts with Hurst of Interest Distribution

topic distribution
Recently Published Documents