Evaluating individual genome similarity with a topic model

Abstract Motivation Evaluating genome similarity among individuals is an essential step in data analysis. Advanced sequencing technology detects more and rarer variants for massive individual genomes, thus enabling individual-level genome similarity evaluation. However, the current methodologies, such as the principal component analysis (PCA), lack the capability to fully leverage rare variants and are also difficult to interpret in terms of population genetics. Results Here, we introduce a probabilistic topic model, latent Dirichlet allocation, to evaluate individual genome similarity. A total of 2535 individuals from the 1000 Genomes Project (KGP) were used to demonstrate our method. Various aspects of variant choice and model parameter selection were studied. We found that relatively rare (0.001<allele frequency < 0.175) and sparse (average interval > 20 000 bp) variants are more efficient for genome similarity evaluation. At least 100 000 such variants are necessary. In our results, the populations show significantly less mixed and more cohesive visualization than the PCA results. The global similarities among the KGP genomes are consistent with known geographical, historical and cultural factors. Availability and implementation The source code and data access are available at: https://github.com/lrjuan/LDA_genome. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identifying User Interests In An Online Discussion Forum With Deep Learning

10.32920/ryerson.14654349.v1 ◽

2021 ◽

Author(s):

Nicholas Buhagiar ◽

Bahram Zahir ◽

Abdolreza Abhari

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Model Framework ◽

User Interests ◽

Online Discussion Forum ◽

Probabilistic Topic Model ◽

Average Accuracy ◽

Discussion Threads ◽

Validation Set ◽

Evaluation Metric

The probabilistic topic model Latent Dirichlet Allocation (LDA) was deployed to model the themes of discourse in discussion threads on the social media aggregation website Reddit. Abstracting discussion threads as vectors of topic weights, these vectors were fed into several neural network architectures, each with a different number of hidden layers, to train machine learning models that could identify which discussion would be of interest for a given user to contribute. Using accuracy as the evaluation metric to determine which model framework achieved the best performance on a given user’s validation set, these selected models achieved an average accuracy of 66.1% on the test data for a sample set of 30 users. Using the predicted probabilities of interest made by these neural networks, recommender systems were further built and analyzed for each user.

Download Full-text

Can topic models be used in research evaluations? Reproducibility, validity, and reliability when compared with semantic maps

Research Evaluation ◽

10.1093/reseval/rvz015 ◽

2019 ◽

Vol 28 (3) ◽

pp. 263-272 ◽

Cited By ~ 1

Author(s):

Tobias Hecking ◽

Loet Leydesdorff

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Research Evaluation ◽

Topic Models ◽

Principal Component ◽

Evaluation Framework ◽

Validity And Reliability ◽

Text Corpora ◽

Semantic Maps ◽

Semantic Coherence

AbstractWe replicate and analyze the topic model which was commissioned to King’s College and Digital Science for the Research Evaluation Framework (REF 2014) in the United Kingdom: 6,638 case descriptions of societal impact were submitted by 154 higher-education institutes. We compare the Latent Dirichlet Allocation (LDA) model with Principal Component Analysis (PCA) of document-term matrices using the same data. Since topic models are almost by definition applied to text corpora which are too large to read, validation of the results of these models is hardly possible; furthermore the models are irreproducible for a number of reasons. However, removing a small fraction of the documents from the sample—a test for reliability—has on average a larger impact in terms of decay on LDA than on PCA-based models. The semantic coherence of LDA models outperforms PCA-based models. In our opinion, results of the topic models are statistical and should not be used for grant selections and micro decision-making about research without follow-up using domain-specific semantic maps.

Download Full-text

Identifying User Interests In An Online Discussion Forum With Deep Learning

10.32920/ryerson.14654349 ◽

2021 ◽

Author(s):

Nicholas Buhagiar ◽

Bahram Zahir ◽

Abdolreza Abhari

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Model Framework ◽

User Interests ◽

Online Discussion Forum ◽

Probabilistic Topic Model ◽

Average Accuracy ◽

Discussion Threads ◽

Validation Set ◽

Evaluation Metric

Download Full-text

Understanding the University-Sustainability Link through Media: A Spanish Perspective

Sustainability ◽

10.3390/su12124830 ◽

2020 ◽

Vol 12 (12) ◽

pp. 4830 ◽

Cited By ~ 1

Author(s):

Cecilia Elizabeth Bayas Aldaz ◽

Jesus Rodriguez-Pomeda ◽

Leyla Angélica Sandoval Hamón ◽

Fernando Casani

Keyword(s):

Higher Education ◽

Social Perception ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Higher Education Institutions ◽

News Coverage ◽

News Sources ◽

University Funding ◽

Probabilistic Topic Model ◽

The Social

This article provides a procedure to universities for understanding the social perception of their activities in the sustainability field, through the analysis of news published in the printed media. It identifies the Spanish news sources that have covered this issue the most and the topics that appear in that news coverage. Using a probabilistic topic model called Latent Dirichlet Allocation, the study includes the nine dominant topics within a corpus with more than seventeen thousand published news items (totaling approximately five and a quarter million words) from a database of almost thirteen hundred national press sources between 2014 and 2017. The study identifies the news sources that published the most news on the issue. It is also found that the amount of news on sustainability and universities declined during the covered period. The nine identified topics point towards the relevance of higher education institutions’ activities as drivers of sustainability. The social perception encapsulated within the topics signals how the public is interested in these activities. Therefore, we find some interesting relationships between sustainable development, higher education institutions’ missions and behaviors, governmental policies, university funding and governance, social and economic innovation, and green campuses in terms of the overall goal of sustainability.

Download Full-text

A Text-Mining Analysis on the Review of the Non-Financial Reporting Directive: Bringing Value Creation for Stakeholders into Accounting

Sustainability ◽

10.3390/su13020763 ◽

2021 ◽

Vol 13 (2) ◽

pp. 763

Author(s):

Simona Fiandrino ◽

Alberto Tonelli

Keyword(s):

Text Mining ◽

Financial Reporting ◽

Value Creation ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Current Debate ◽

The Core ◽

Probabilistic Topic Model ◽

Integrated Logic

The recent Review of the Non-Financial Reporting Directive (NFRD) aims to enhance adequate non-financial information (NFI) disclosure and improve accountability for stakeholders. This study focuses on this regulatory intervention and has a twofold objective: First, it aims to understand the main underlying issues at stake; second, it suggests areas of possible amendment considering the current debates on sustainability accounting and accounting for stakeholders. In keeping with these aims, the research analyzes the documents annexed to the contribution on the Review of the NFRD by conducting a text-mining analysis with latent Dirichlet allocation (LDA) probabilistic topic model (PTM). Our findings highlight four main topics at the core of the current debate: quality of NFI, standardization, materiality, and assurance. The research suggests ways of improving managerial policies to achieve more comparable, relevant, and reliable information by bringing value creation for stakeholders into accounting. It further addresses an integrated logic of accounting for stakeholders that contributes to sustainable development.

Download Full-text

Detection of Cases of Noncompliance to Drug Treatment in Patient Forum Posts: Topic Model Approach (Preprint)

10.2196/preprints.9222 ◽

2017 ◽

Author(s):

Redhouane Abdellaoui ◽

Pierre Foulquié ◽

Nathalie Texier ◽

Carole Faviez ◽

Anita Burgun ◽

...

Keyword(s):

Social Media ◽

Virtual Communities ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Antidepressant Drug ◽

Topic Models ◽

Web Crawler ◽

Probabilistic Topic Model ◽

Manual Review ◽

Model Approach

BACKGROUND Medication nonadherence is a major impediment to the management of many health conditions. A better understanding of the factors underlying noncompliance to treatment may help health professionals to address it. Patients use peer-to-peer virtual communities and social media to share their experiences regarding their treatments and diseases. Using topic models makes it possible to model themes present in a collection of posts, thus to identify cases of noncompliance. OBJECTIVE The aim of this study was to detect messages describing patients’ noncompliant behaviors associated with a drug of interest. Thus, the objective was the clustering of posts featuring a homogeneous vocabulary related to nonadherent attitudes. METHODS We focused on escitalopram and aripiprazole used to treat depression and psychotic conditions, respectively. We implemented a probabilistic topic model to identify the topics that occurred in a corpus of messages mentioning these drugs, posted from 2004 to 2013 on three of the most popular French forums. Data were collected using a Web crawler designed by Kappa Santé as part of the Detec’t project to analyze social media for drug safety. Several topics were related to noncompliance to treatment. RESULTS Starting from a corpus of 3650 posts related to an antidepressant drug (escitalopram) and 2164 posts related to an antipsychotic drug (aripiprazole), the use of latent Dirichlet allocation allowed us to model several themes, including interruptions of treatment and changes in dosage. The topic model approach detected cases of noncompliance behaviors with a recall of 98.5% (272/276) and a precision of 32.6% (272/844). CONCLUSIONS Topic models enabled us to explore patients’ discussions on community websites and to identify posts related with noncompliant behaviors. After a manual review of the messages in the noncompliance topics, we found that noncompliance to treatment was present in 6.17% (276/4469) of the posts.

Download Full-text

Integrating Topic, Sentiment, and Syntax for Modeling Online Reviews: A Topic Model Approach

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4041475 ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 4

Author(s):

Min Tang ◽

Jian Jin ◽

Ying Liu ◽

Chunping Li ◽

Weiwen Zhang

Keyword(s):

Visual Information ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Online Reviews ◽

Sentiment Classification ◽

Academic Field ◽

Aspect Extraction ◽

Part Of Speech ◽

Probabilistic Topic Model ◽

Topic Sentiment

Analyzing product online reviews has drawn much interest in the academic field. In this research, a new probabilistic topic model, called tag sentiment aspect models (TSA), is proposed on the basis of Latent Dirichlet allocation (LDA), which aims to reveal latent aspects and corresponding sentiment in a review simultaneously. Unlike other topic models which consider words in online reviews only, syntax tags are taken as visual information and, in this research, as a kind of widely used syntax information, part-of-speech (POS) tags are first reckoned. Specifically, POS tags are integrated into three versions of implementation in consideration of the fact that words with different POS tags might be utilized to express consumers' opinions. Also, the proposed TSA is one unsupervised approach and only a small number of positive and negative words are required to confine different priors for training. Finally, two big datasets regarding digital SLR and laptop are utilized to evaluate the performance of the proposed model in terms of sentiment classification and aspect extraction. Comparative experiments show that the new model can not only achieve promising results on sentiment classification but also leverage the performance on aspect extraction.

Download Full-text

SPUCL (Scientific Publication Classifier): A Human-Readable Labelling System for Scientific Publications

Applied Sciences ◽

10.3390/app11199154 ◽

2021 ◽

Vol 11 (19) ◽

pp. 9154

Author(s):

Noemi Scarpato ◽

Alessandra Pieroni ◽

Michela Montorsi

Keyword(s):

Latent Dirichlet Allocation ◽

Scientific Literature ◽

Topic Model ◽

Scientific Publication ◽

Research Field ◽

Classifier Systems ◽

Scientific Publications ◽

Probabilistic Topic Model ◽

Probabilistic Machine Learning ◽

Labelling System

To assess critically the scientific literature is a very challenging task; in general it requires analysing a lot of documents to define the state-of-the-art of a research field and classifying them. The documents classifier systems have tried to address this problem by different techniques such as probabilistic, machine learning and neural networks models. One of the most popular document classification approaches is the LDA (Latent Dirichlet Allocation), a probabilistic topic model. One of the main issues of the LDA approach is that the retrieved topics are a collection of terms with their probabilities and it does not have a human-readable form. This paper defines an approach to make LDA topics comprehensible for humans by the exploitation of the Word2Vec approach.

Download Full-text

C-factor: a summary measure for systemic arterial calcifications

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-02126-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Lieke M. Kuiper ◽

M. Kamran Ikram ◽

Maryam Kavousi ◽

Meike W. Vernooij ◽

M. Arfan Ikram ◽

...

Keyword(s):

Cox Regression ◽

Intraclass Correlation ◽

Principal Component ◽

Arterial Calcification ◽

Standard Deviation Increase ◽

Single Measure ◽

Individual Level ◽

Enhanced Ct ◽

Study Participants ◽

Overall Mortality

Abstract Background Arterial calcification, the hallmark of arteriosclerosis, has a widespread distribution in the human body with only moderate correlation among sites. Hitherto, a single measure capturing the systemic burden of arterial calcification was lacking. In this paper, we propose the C-factor as an overall measure of calcification burden. Methods To quantify calcification in the coronary arteries, aortic arch, extra- and intracranial carotid arteries, and vertebrobasilar arteries, 2384 Rotterdam Study participants underwent cardiac and extra-cardiac non-enhanced CT. We performed principal component analyses on the calcification volumes of all twenty-six possible combinations of these vessel beds. Each analysis’ first principal component represents the C-factor. Subsequently, we determined the correlation between the C-factor derived from all vessel beds and the other C-factors with intraclass correlation coefficient (ICC) analyses. Finally, we examined the association of the C-factor and calcification in the separate vessel beds with cardiovascular, non-cardiovascular, and overall mortality using Cox–regression analyses. Results The ICCs ranged from 0.80 to 0.99. Larger calcification volumes and a higher C-factor were all individually associated with higher risk of cardiovascular, non-cardiovascular, and overall mortality. When included simultaneously in a model, the C-factor was still associated with all three mortality types (adjusted hazard ratio per standard deviation increase (HR) > 1.52), whereas associations of the separate vessel beds with mortality attenuated substantially (HR < 1.26). Conclusions The C-factor summarizes the systemic component of arterial calcification on an individual level and appears robust among different combinations of vessel beds. Importantly, when mutually adjusted, the C-factor retains its strength of association with mortality while the site-specific associations attenuate.

Download Full-text

Intelligent radar software defect classification approach based on the latent Dirichlet allocation topic model

EURASIP Journal on Advances in Signal Processing ◽

10.1186/s13634-021-00761-3 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Xi Liu ◽

Yongfeng Yin ◽

Haifeng Li ◽

Jiabin Chen ◽

Chang Liu ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Recall Rate ◽

Defect Classification ◽

Software Defects ◽

Classification Approach ◽

Software Defect ◽

Model Combining ◽

Dirichlet Allocation

AbstractExisting software intelligent defect classification approaches do not consider radar characters and prior statistics information. Thus, when applying these appaoraches into radar software testing and validation, the precision rate and recall rate of defect classification are poor and have effect on the reuse effectiveness of software defects. To solve this problem, a new intelligent defect classification approach based on the latent Dirichlet allocation (LDA) topic model is proposed for radar software in this paper. The proposed approach includes the defect text segmentation algorithm based on the dictionary of radar domain, the modified LDA model combining radar software requirement, and the top acquisition and classification approach of radar software defect based on the modified LDA model. The proposed approach is applied on the typical radar software defects to validate the effectiveness and applicability. The application results illustrate that the prediction precison rate and recall rate of the poposed approach are improved up to 15 ~ 20% compared with the other defect classification approaches. Thus, the proposed approach can be applied in the segmentation and classification of radar software defects effectively to improve the identifying adequacy of the defects in radar software.

Download Full-text