topic distribution
Recently Published Documents


TOTAL DOCUMENTS

68
(FIVE YEARS 39)

H-INDEX

5
(FIVE YEARS 2)

2022 ◽  
Vol 40 (1) ◽  
pp. 1-29
Author(s):  
Siqing Li ◽  
Yaliang Li ◽  
Wayne Xin Zhao ◽  
Bolin Ding ◽  
Ji-Rong Wen

Citation count prediction is an important task for estimating the future impact of research papers. Most of the existing works utilize the information extracted from the paper itself. In this article, we focus on how to utilize another kind of useful data signal (i.e., peer review text) to improve both the performance and interpretability of the prediction models. Specially, we propose a novel aspect-aware capsule network for citation count prediction based on review text. It contains two major capsule layers, namely the feature capsule layer and the aspect capsule layer, with two different routing approaches, respectively. Feature capsules encode the local semantics from review sentences as the input of aspect capsule layer, whereas aspect capsules aim to capture high-level semantic features that will be served as final representations for prediction. Besides the predictive capacity, we also enhance the model interpretability with two strategies. First, we use the topic distribution of the review text to guide the learning of aspect capsules so that each aspect capsule can represent a specific aspect in the review. Then, we use the learned aspect capsules to generate readable text for explaining the predicted citation count. Extensive experiments on two real-world datasets have demonstrated the effectiveness of the proposed model in both performance and interpretability.


2022 ◽  
Vol 40 (1) ◽  
pp. 1-24
Author(s):  
Seyed Ali Bahrainian ◽  
George Zerveas ◽  
Fabio Crestani ◽  
Carsten Eickhoff

Neural sequence-to-sequence models are the state-of-the-art approach used in abstractive summarization of textual documents, useful for producing condensed versions of source text narratives without being restricted to using only words from the original text. Despite the advances in abstractive summarization, custom generation of summaries (e.g., towards a user’s preference) remains unexplored. In this article, we present CATS, an abstractive neural summarization model that summarizes content in a sequence-to-sequence fashion while also introducing a new mechanism to control the underlying latent topic distribution of the produced summaries. We empirically illustrate the efficacy of our model in producing customized summaries and present findings that facilitate the design of such systems. We use the well-known CNN/DailyMail dataset to evaluate our model. Furthermore, we present a transfer-learning method and demonstrate the effectiveness of our approach in a low resource setting, i.e., abstractive summarization of meetings minutes, where combining the main available meetings’ transcripts datasets, AMI and International Computer Science Institute(ICSI) , results in merely a few hundred training documents.


2021 ◽  
pp. 1-13
Author(s):  
Dangguo Shao ◽  
Chengyao Li ◽  
Chusheng Huang ◽  
Qing An ◽  
Yan Xiang ◽  
...  

Aiming at the low effectiveness of short texts feature extraction, this paper proposes a short texts classification model based on the improved Wasserstein-Latent Dirichlet Allocation (W-LDA), which is a neural network topic model based on the Wasserstein Auto-Encoder (WAE) framework. The improvements of W-LDA are as follows: Firstly, the Bag of Words (BOW) input in the W-LDA is preprocessed by Term Frequency–Inverse Document Frequency (TF-IDF); Subsequently, the prior distribution of potential topics in W-LDA is replaced from the Dirichlet distribution to the Gaussian mixture distribution, which is based on the Variational Bayesian inference; And then the sparsemax function layer is introduced after the hidden layer inferred by the encoder network to generate a sparse document-topic distribution with better topic relevance, the improved W-LDA is named the Sparse Wasserstein-Variational Bayesian Gaussian mixture model (SW-VBGMM); Finally, the document-topic distribution generated by SW-VBGMM is input to BiGRU (Bidirectional Gating Recurrent Unit) for the deep feature extraction and the short texts classification. Experiments on three Chinese short texts datasets and one English dataset represent that our model is better than some common topic models and neural network models in the four evaluation indexes (accuracy, precision, recall, F1 value) of text classification.


2021 ◽  
Author(s):  
Shaoke Lou ◽  
Tianxiao Li ◽  
Mark Gerstein

AbstractThe severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has caused millions of deaths worldwide. Many efforts have focused on unraveling the mechanism of the viral infection to develop effective strategies for treatment and prevention. Previous studies have provided some clarity on the protein-protein interaction linkages occurring during the life cycle of viral infection; however, we lack a complete understanding of the full interactome, comprising human miRNAs and protein-coding genes and co-infecting microbes. To comprehensively determine this, we developed a statistical modeling method using latent Dirichlet allocation (called MLCrosstalk, for multiple-layer crosstalk) to fuse many types of data to construct the full interactome of SARS-CoV-2. Specifically, MLCrosstalk is able to integrate samples with multiple layers of information (e.g., miRNA and microbes), enforce a consistent topic distribution on all data types, and infer individual-level linkages (i.e., differing between patients). We also implement a secondary refinement with network propagation to allow our microbe-gene linkages to address larger network structures (e.g., pathways). Using MLCrosstalk, we generated a list of genes and microbes linked to SARS-CoV-2. Interestingly, we found that two of the identified microbes, Rothia mucilaginosa and Prevotella melaninogenica, show distinct patterns representing synergistic and antagonistic relationships with the virus, respectively. We also identified several SARS-COV-2-associated pathways, including the VEGFA-VEGFR2 and immune response pathways, which may provide potential targets for drug design.


2021 ◽  
Vol 6 (1) ◽  
Author(s):  
Bojan Evkoski ◽  
Nikola Ljubešić ◽  
Andraž Pelicon ◽  
Igor Mozetič ◽  
Petra Kralj Novak

AbstractTwitter data exhibits several dimensions worth exploring: a network dimension in the form of links between the users, textual content of the tweets posted, and a temporal dimension as the time-stamped sequence of tweets and their retweets. In the paper, we combine analyses along all three dimensions: temporal evolution of retweet networks and communities, contents in terms of hate speech, and discussion topics. We apply the methods to a comprehensive set of all Slovenian tweets collected in the years 2018–2020. We find that politics and ideology are the prevailing topics despite the emergence of the Covid-19 pandemic. These two topics also attract the highest proportion of unacceptable tweets. Through time, the membership of retweet communities changes, but their topic distribution remains remarkably stable. Some retweet communities are strongly linked by external retweet influence and form super-communities. The super-community membership closely corresponds to the topic distribution: communities from the same super-community are very similar by the topic distribution, and communities from different super-communities are quite different in terms of discussion topics. However, we also find that even communities from the same super-community differ considerably in the proportion of unacceptable tweets they post.


2021 ◽  
Vol 8 (6) ◽  
pp. 1265
Author(s):  
Muhammad Alkaff ◽  
Andreyan Rizky Baskara ◽  
Irham Maulani

<p>Sebuah sistem layanan untuk menyampaikan aspirasi dan keluhan masyarakat terhadap layanan pemerintah Indonesia, bernama Lapor! Pemerintah sudah lama memanfaatkan sistem tersebut untuk menjawab permasalahan masyarakat Indonesia terkait permasalahan birokrasi. Namun, peningkatan volume laporan dan pemilahan laporan yang dilakukan oleh operator dengan membaca setiap keluhan yang masuk melalui sistem menyebabkan sering terjadi kesalahan dimana operator meneruskan laporan tersebut ke instansi yang salah. Oleh karena itu, diperlukan suatu solusi yang dapat menentukan konteks laporan secara otomatis dengan menggunakan teknik Natural Language Processing. Penelitian ini bertujuan untuk membangun klasifikasi laporan secara otomatis berdasarkan topik laporan yang ditujukan kepada instansi yang berwenang dengan menggabungkan metode Latent Dirichlet Allocation (LDA) dan Support Vector Machine (SVM). Proses pemodelan topik untuk setiap laporan dilakukan dengan menggunakan metode LDA. Metode ini mengekstrak laporan untuk menemukan pola tertentu dalam dokumen yang akan menghasilkan keluaran dalam nilai distribusi topik. Selanjutnya, proses klasifikasi untuk menentukan laporan agensi tujuan dilakukan dengan menggunakan SVM berdasarkan nilai topik yang diekstraksi dengan metode LDA. Performa model LDA-SVM diukur dengan menggunakan confusion matrix dengan menghitung nilai akurasi, presisi, recall, dan F1 Score. Hasil pengujian menggunakan teknik split train-test dengan skor 70:30 menunjukkan bahwa model menghasilkan kinerja yang baik dengan akurasi 79,85%, presisi 79,98%, recall 72,37%, dan Skor F1 74,67%.</p><p> </p><p><em><strong>Abstract</strong></em></p><p><em>A service system to convey aspirations and complaints from the public against Indonesia's government services, named Lapor! The Government has used the Government for a long time to answer the problems of the Indonesian people related to bureaucratic problems. However, the increasing volume of reports and the sorting of reports carried out by operators by reading every complaint that comes through the system cause frequent errors where operators forward the reports to the wrong agencies. Therefore, we need a solution that can automatically determine the report's context using Natural Language Processing techniques. This study aims to build automatic report classifications based on report topics addressed to authorized agencies by combining Latent Dirichlet Allocation (LDA) and Support Vector Machine (SVM). The topic-modeling process for each report was carried out using the LDA method. This method extracts reports to find specific patterns in documents that will produce output in topic distribution values. Furthermore, the classification process to determine the report's destination agency carried out using the SVM based on the value of the topics extracted by the LDA method. The LDA-SVM model's performance is measured using a confusion matrix by calculating the value of accuracy, precision, recall, and F1 Score. The test results using the train-test split technique with a 70:30 show that the model produces good performance with 79.85% accuracy, 79.98% precision, 72.37% recall, and 74.67% F1 Score</em></p><p><em><strong><br /></strong></em></p>


2021 ◽  
Vol 32 (3) ◽  
pp. 46-68
Author(s):  
Fei Liu ◽  
Meiyun Zuo

The COVID-19 pandemic is an ongoing global pandemic, which has caused global social and economic disruption. In addition to physical illness, people have to endure the intrusion of rumors psychologically. Thus, it is critical to summarize the correlating infodemic, a significant part of COVID-19, to eventually defeat the epidemic. This article aims to mine the topic distribution and evolution patterns of online rumors by comparing and contrasting COVID-19 rumors from the two most popular rumor-refuting platforms—Jiaozhen in China and Full Fact in the United Kingdom (UK)—via a novel topic mining model, text clustering based on bidirectional encoder representations from transformers (BERT), and lifecycle theory. This comparison and contrast can enrich the research of infodemiology based on the spatio-temporal aspect, providing practical guidance for governments, rumor-refuting platforms, and individuals. The comparative study highlights the similarities and differences of online rumors about global public health emergencies across countries.


2021 ◽  
Vol 13 (4) ◽  
pp. 40-56
Author(s):  
Jiaohua Qin ◽  
Zhuo Zhou ◽  
Yun Tan ◽  
Xuyu Xiang ◽  
Zhibin He

Coverless information hiding has become a hot topic in recent years. The existing steganalysis tools are invalidated due to coverless steganography without any modification to the carrier. However, for the text coverless has relatively low hiding capacity, this paper proposed a big data text coverless information hiding method based on LDA (latent Dirichlet allocation) topic distribution and keyword TF-IDF (term frequency-inverse document frequency). Firstly, the sender and receiver build codebook, including word segmentation, word frequency and TF-IDF features, LDA topic model clustering. The sender then shreds the secret information, converts it into keyword ID through the keywords-index table, and searches the text containing the secret information keywords. Secondly, the searched text is taken as the index tag according to the topic distribution and TF-IDF features. At the same time, random numbers are introduced to control the keyword order of secret information.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Xiujuan Wang ◽  
Yi Sui ◽  
Yuanrui Tao ◽  
Qianqian Zhang ◽  
Jianhua Wei

With the rapid development of the Internet since the beginning of the 21st century, social networks have provided a significant amount of convenience for work, study, and entertainment. Specifically, because of the irreplaceable superiority of social platforms in disseminating information, criminals have thus updated the main methods of social engineering attacks. Detecting abnormal accounts on social networks in a timely manner can effectively prevent the occurrence of malicious Internet events. Different from previous research work, in this work, a method of anomaly detection called Hurst of Interest Distribution is proposed based on the stability of user interest quantifiable from the content of users’ tweets, so as to detect abnormal accounts. In detail, the Latent Dirichlet Allocation model is adopted to classify blog content on Twitter into topics to calculate and obtain the topic distribution of tweets sent by a single user within a period of time. Then, the stability degree of the user’s tweet topic preference is calculated according to the Hurst index to determine whether the account is compromised. Through experiments, the Hurst indexes of normal and abnormal accounts are found to be significantly different, and the detection rate of abnormal accounts using the proposed method can reach up to 97.93%.


Sign in / Sign up

Export Citation Format

Share Document