Federated Latent Dirichlet Allocation: A Local Differential Privacy Based Framework

Yansheng Wang; Yongxin Tong; Dingyuan Shi

doi:10.1609/aaai.v34i04.6096

Federated Latent Dirichlet Allocation: A Local Differential Privacy Based Framework

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6096 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6283-6290 ◽

Cited By ~ 2

Author(s):

Yansheng Wang ◽

Yongxin Tong ◽

Dingyuan Shi

Keyword(s):

Data Privacy ◽

Latent Dirichlet Allocation ◽

Differential Privacy ◽

Topic Model ◽

Text Data ◽

Data Collector ◽

Industrial Grade ◽

Model Training ◽

Open Datasets ◽

Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a widely adopted topic model for industrial-grade text mining applications. However, its performance heavily relies on the collection of large amount of text data from users' everyday life for model training. Such data collection risks severe privacy leakage if the data collector is untrustworthy. To protect text data privacy while allowing accurate model training, we investigate federated learning of LDA models. That is, the model is collaboratively trained between an untrustworthy data collector and multiple users, where raw text data of each user are stored locally and not uploaded to the data collector. To this end, we propose FedLDA, a local differential privacy (LDP) based framework for federated learning of LDA models. Central in FedLDA is a novel LDP mechanism called Random Response with Priori (RRP), which provides theoretical guarantees on both data privacy and model accuracy. We also design techniques to reduce the communication cost between the data collector and the users during model training. Extensive experiments on three open datasets verified the effectiveness of our solution.

Get full-text (via PubEx)

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Get full-text (via PubEx)

Latent Dirichlet Allocation Model Training With Differential Privacy

IEEE Transactions on Information Forensics and Security ◽

10.1109/tifs.2020.3032021 ◽

2021 ◽

Vol 16 ◽

pp. 1290-1305

Author(s):

Fangyuan Zhao ◽

Xuebin Ren ◽

Shusen Yang ◽

Qing Han ◽

Peng Zhao ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Differential Privacy ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Model Training ◽

Dirichlet Allocation

Get full-text (via PubEx)

On Privacy Protection of Latent Dirichlet Allocation Model Training

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/675 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fangyuan Zhao ◽

Xuebin Ren ◽

Shusen Yang ◽

Xinyu Yang

Keyword(s):

Machine Learning ◽

Latent Dirichlet Allocation ◽

Differential Privacy ◽

Machine Learning Algorithms ◽

Sensitive Information ◽

Training Algorithm ◽

Allocation Model ◽

Model Training ◽

Real World Datasets ◽

Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the Collapsed Gibbs Sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for individual data contributors. The experimental results on real-world datasets demonstrate the effectiveness of our proposed algorithms.

Get full-text (via PubEx)

Improving Privacy Guarantee and Efficiency of Latent Dirichlet Allocation Model Training Under Differential Privacy

10.18653/v1/2021.findings-emnlp.14 ◽

2021 ◽

Author(s):

Tao Huang ◽

Hong Chen

Keyword(s):

Latent Dirichlet Allocation ◽

Differential Privacy ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Model Training ◽

Dirichlet Allocation

Get full-text (via PubEx)

Intelligent radar software defect classification approach based on the latent Dirichlet allocation topic model

EURASIP Journal on Advances in Signal Processing ◽

10.1186/s13634-021-00761-3 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Xi Liu ◽

Yongfeng Yin ◽

Haifeng Li ◽

Jiabin Chen ◽

Chang Liu ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Recall Rate ◽

Defect Classification ◽

Software Defects ◽

Classification Approach ◽

Software Defect ◽

Model Combining ◽

Dirichlet Allocation

AbstractExisting software intelligent defect classification approaches do not consider radar characters and prior statistics information. Thus, when applying these appaoraches into radar software testing and validation, the precision rate and recall rate of defect classification are poor and have effect on the reuse effectiveness of software defects. To solve this problem, a new intelligent defect classification approach based on the latent Dirichlet allocation (LDA) topic model is proposed for radar software in this paper. The proposed approach includes the defect text segmentation algorithm based on the dictionary of radar domain, the modified LDA model combining radar software requirement, and the top acquisition and classification approach of radar software defect based on the modified LDA model. The proposed approach is applied on the typical radar software defects to validate the effectiveness and applicability. The application results illustrate that the prediction precison rate and recall rate of the poposed approach are improved up to 15 ~ 20% compared with the other defect classification approaches. Thus, the proposed approach can be applied in the segmentation and classification of radar software defects effectively to improve the identifying adequacy of the defects in radar software.

Get full-text (via PubEx)

Research progress and trend of leader member exchange based on social complex network and latent dirichlet allocation topic model

2020 2nd International Conference on Economic Management and Model Engineering (ICEMME) ◽

10.1109/icemme51517.2020.00090 ◽

2020 ◽

Author(s):

Zhang chunyang ◽

Ding kun ◽

Zhang chunbo ◽

Zhang li

Keyword(s):

Complex Network ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Research Progress ◽

Leader Member Exchange ◽

Member Exchange ◽

Dirichlet Allocation

Get full-text (via PubEx)

Augmented Latent Dirichlet Allocation (Lda) Topic Model with Gaussian Mixture Topics

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8462003 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kedar S. Prabhudesai ◽

Boyla O. Mainsah ◽

Leslie M. Collins ◽

Chandra S. Throckmorton

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Gaussian Mixture ◽

Dirichlet Allocation

Get full-text (via PubEx)

An exploration of text mining of narrative reports of injury incidents to assess risk

MATEC Web of Conferences ◽

10.1051/matecconf/201825106020 ◽

2018 ◽

Vol 251 ◽

pp. 06020 ◽

Cited By ~ 4

Author(s):

David Passmore ◽

Chungil Chae ◽

Yulia Kustikova ◽

Rose Baker ◽

Jeong-Ha Yim

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Surface Mining ◽

Modeling Processes ◽

Free Text ◽

Text Data ◽

Injury Occurrence ◽

The Usa ◽

Musculoskeletal Systems ◽

Topic Mining

A topic model was explored using unsupervised machine learning to summarized free-text narrative reports of 77,215 injuries that occurred in coal mines in the USA between 2000 and 2015. Latent Dirichlet Allocation modeling processes identified six topics from the free-text data. One topic, a theme describing primarily injury incidents resulting in strains and sprains of musculoskeletal systems, revealed differences in topic emphasis by the location of the mine property at which injuries occurred, the degree of injury, and the year of injury occurrence. Text narratives clustered around this topic refer most frequently to surface or other locations rather than underground locations that resulted in disability and that, also, increased secularly over time. The modeling success enjoyed in this exploratory effort suggests that additional topic mining of these injury text narratives is justified, especially using a broad set of covariates to explain variations in topic emphasis and for comparison of surface mining injuries with injuries occurring during site preparation for construction.

Get full-text (via PubEx)

A Latent Dirichlet Allocation and Fuzzy Clustering Based Machine Learning Model for Text Thesaurus

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2020.2.3811 ◽

2020 ◽

Vol 15 (2) ◽

Author(s):

Jia Luo ◽

Dongwen Yu ◽

Zong Dai

Keyword(s):

Machine Learning ◽

Fuzzy Clustering ◽

Latent Dirichlet Allocation ◽

Learning Model ◽

Machine Learning Algorithms ◽

Text Data ◽

Huge Data ◽

Machine Learning Model ◽

N Gram ◽

Dirichlet Allocation

It is not quite possible to use manual methods to process the huge amount of structured and semi-structured data. This study aims to solve the problem of processing huge data through machine learning algorithms. We collected the text data of the company’s public opinion through crawlers, and use Latent Dirichlet Allocation (LDA) algorithm to extract the keywords of the text, and uses fuzzy clustering to cluster the keywords to form different topics. The topic keywords will be used as a seed dictionary for new word discovery. In order to verify the efficiency of machine learning in new word discovery, algorithms based on association rules, N-Gram, PMI, andWord2vec were used for comparative testing of new word discovery. The experimental results show that the Word2vec algorithm based on machine learning model has the highest accuracy, recall and F-value indicators.

Get full-text (via PubEx)

The surveillance of a supreme audit institution on related party transactions

Journal of Public Budgeting Accounting & Financial Management ◽

10.1108/jpbafm-12-2019-0181 ◽

2020 ◽

Vol 32 (4) ◽

pp. 577-603

Author(s):

Gustavo Cesário ◽

Ricardo Lopes Cardoso ◽

Renato Santos Aranha

Keyword(s):

Public Sector ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Conflicts Of Interest ◽

International Standards ◽

Content Type ◽

Related Party Transactions ◽

The Public ◽

Audit Reports ◽

Dirichlet Allocation

PurposeThis paper aims to analyse how the supreme audit institution (SAI) monitors related party transactions (RPTs) in the Brazilian public sector. It considers definitions and disclosure policies of RPTs by international accounting and auditing standards and their evolution since 1980.Design/methodology/approachBased on archival research on international standards and using an interpretive approach, the authors investigated definitions and disclosure policies. Using a topic model based on latent Dirichlet allocation, the authors performed a content analysis on over 59,000 SAI decisions to assess how the SAI monitors RPTs.FindingsThe SAI investigates nepotism (a kind of RPT) and conflicts of interest up to eight times more frequently than related parties. Brazilian laws prevent nepotism and conflicts of interest, but not RPTs in general. Indeed, Brazilian public-sector accounting standards have not converged towards IPSAS 20, and ISSAI 1550 does not adjust auditing procedures to suit the public sector.Research limitations/implicationsThe SAI follows a legalistic auditing approach, indicating a need for regulation of related public-sector parties to improve surveillance. In addition to Brazil, other code law countries might face similar circumstances.Originality/valuePublic-sector RPTs are an under-investigated field, calling for attention by academics and standard-setters. Text mining and latent Dirichlet allocation, while mature techniques, are underexplored in accounting and auditing studies. Additionally, the Python script created to analyse the audit reports is available at Mendeley Data and may be used to perform similar analyses with minor adaptations.

Get full-text (via PubEx)