scholarly journals Continuous chromatin state feature annotation of the human epigenome

2018 ◽  
Author(s):  
Bowen Chen ◽  
Neda Shokraneh Kenari ◽  
Maxwell W Libbrecht

AbstractSemi-automated genome annotation (SAGA) methods are widely used to understand genome activity and gene regulation. These methods take as input a set of sequencing-based assays of epigenomic activity (such as ChIP-seq measurements of histone modification and transcription factor binding), and output an annotation of the genome that assigns a chromatin state label to each genomic position. Existing SAGA methods have several limitations caused by the discrete annotation framework: such annotations cannot easily represent varying strengths of genomic elements, and they cannot easily represent combinatorial elements that simultaneously exhibit multiple types of activity. To remedy these limitations, we propose an annotation strategy that instead outputs a vector of chromatin state features at each position rather than a single discrete label. Continuous modeling is common in other fields, such as in topic modeling of text documents. We propose a method, epigenome-ssm, that uses a Kalman filter state space model to efficiently annotate the genome with chromatin state features. We show that chromatin state features from epigenome-ssm are more useful for several downstream applications than both continuous and discrete alternatives, including their ability to identify expressed genes and enhancers. Therefore, we expect that these continuous chromatin state features will be valuable reference annotations to be used in visualization and downstream analysis.


Author(s):  
Peter Ebert ◽  
Marcel H Schulz

Abstract Motivation The generation of genome-wide maps of histone modifications using chromatin immunoprecipitation sequencing (ChIP-seq) is a standard approach to dissect the complexity of the epigenome. Interpretation and differential analysis of histone datasets remains challenging due to regulatory meaningful co-occurrences of histone marks and their difference in genomic spread. To ease interpretation, chromatin state segmentation maps are a commonly employed abstraction combining individual histone marks. We developed the tool SCIDDO as a fast, flexible, and statistically sound method for the differential analysis of chromatin state segmentation maps. Results We demonstrate the utility of SCIDDO in a comparative analysis that identifies differential chromatin domains (DCD) in various regulatory contexts and with only moderate computational resources. We show that the identified DCDs correlate well with observed changes in gene expression and can recover a substantial number of differentially expressed genes. We showcase SCIDDO’s ability to directly interrogate chromatin dynamics such as enhancer switches in downstream analysis, which simplifies exploring specific questions about regulatory changes in chromatin. By comparing SCIDDO to competing methods, we provide evidence that SCIDDO’s performance in identifying differentially expressed genes (DEG) via differential chromatin marking is more stable across a range of cell-type comparisons and parameter cut-offs. Availability The SCIDDO source code is openly available under github.com/ptrebert/sciddo Supplementary information Supplementary data are available at Bioinformatics online.



Explosion of Web 2.0 had made different social media platforms like Facebook, Twitter, Blogs, etc a data hub for the task of Data Mining. Sentiment Analysis or Opinion mining is an automated process of understanding an opinion expressed by customers. By using Data mining techniques, sentiment analysis helps in determining the polarity (Positive, Negative & Neutral) of views expressed by the end user. Nowadays there are terabytes of data available related to any topic then it can be advertising, politics and Survey Companies, etc. CSAT (Customer Satisfaction) is the key factor for this survey companies. In this paper, we used topic modeling by incorporating a LDA algorithm for finding the topics related to social media. We have used datasets of 900 records for analysis. By analysis, we found three important topics from Survey/Response dataset, which are Customers, Agents & Product/Services. Results depict the CSAT score according to Positive, Negative and Neutral response. We used topic modeling which is a statistical modeling technique. Topic modeling is a technique for categorization of text documents into different topics. This approach helps in better summarization of data according to the topic identification and depiction of polarity classification of sentiments expressed.



2016 ◽  
Vol 4 (47) ◽  
pp. 92
Author(s):  
Sergey Nikolaevich Karpovich


Author(s):  
Федор Владимирович Краснов ◽  
Ирина Сергеевна Смазневич

С развитием все более сложных методов автоматического анализа текста повышается важность задачи объяснения пользователю, почему прикладная интеллектуальная информационная система выделяет некоторые тексты как схожие по смыслу. В работе рассмотрены ограничения, которые такая постановка накладывает на используемые интеллектуальные алгоритмы. Проведенный авторами эксперимент показал, что абсолютное значение схожести документов не универсально по отношению к интеллектуальному алгоритму, поэтому оптимальную пороговую величину схожести необходимо устанавливать отдельно для каждой решаемой задачи. Полученные результаты могут быть использованы при оценке применимости различных методов установления смысловой схожести между документами в прикладных информационных системах, а также при выборе оптимальных параметров модели с учетом требований объяснимости решения. The problem of providing a comprehensive explanation to any user why the applied intelligent information system suggests meaning similarity in certain texts imposes significant requirements on the intelligent algorithms. The article covers the entire set of technologies involved in the solution of the text clustering problem and several conclusions are stated thereof. Matrix decomposition aimed at reducing the dimension of the vector representation of a corpus does not provide clear explanatiom of the algorithmic principles to a user. Ranking using the TF-IDF function and its modifications finds a few documents that are similar in meaning, however, this method is the easiest for users to comprehend, since algorithms of this type detect specific matching words in the compared texts. Topic modeling methods (LSI, LDA, ARTM) assign large similarity values to texts despite a few matching words, while a person can easily tell that the general subject of the texts is the same. Yet the explanation of how topic modeling works requires additional effort for interpretation of the detected ones. This interpretation gets easier as the model quality grows, while the quality can be optimized by its average coherence. The experiment demonstrated that the absolute value of documents similarity is not invariant for different intelligent algorithms, so the optimal threshold value of similarity must be set separately for each problem to be solved. The results of the work can be further used to assess which of the various methods developed to detect meaning similarity in texts can be effectively implemented in applied information systems and to determine the optimal model parameters based on the solution explicability requirements.



Author(s):  
Carlo Schwarz

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.



Author(s):  
Kristofferson Culmer ◽  
Jeffrey Uhlmann

The short lengths of tweets present a challenge for topic modeling to extend beyond what is provided explicitly from hashtag information. This is particularly true for LDAbased methods because the amount of information available from pertweet statistical analysis is severely limited. In this paper we present LDA2Vec paired with temporal tweet pooling (LDA2VecTTP) and assess its performance on this problem relative to traditional LDA and to Biterm Topic Model (Biterm), which was developed specifically for topic modeling on short text documents. We paired each of the three topic modeling algorithms with three tweet pooling schemes: no pooling, authorbased pooling, and temporal pooling. We then conducted topic modeling on two Twitter datasets using each of the algorithms and the tweet pooling schemes. Our results on the largest dataset suggest that LDA2VecTTP can produce higher coherence scores and more logically coherent and interpretable topics.



Author(s):  
Junaid Rashid ◽  
Syed Muhammad Adnan Shah ◽  
Aun Irtaza

Topic modeling is an effective text mining and information retrieval approach to organizing knowledge with various contents under a specific topic. Text documents in form of news articles are increasing very fast on the web. Analysis of these documents is very important in the fields of text mining and information retrieval. Meaningful information extraction from these documents is a challenging task. One approach for discovering the theme from text documents is topic modeling but this approach still needs a new perspective to improve its performance. In topic modeling, documents have topics and topics are the collection of words. In this paper, we propose a new k-means topic modeling (KTM) approach by using the k-means clustering algorithm. KTM discovers better semantic topics from a collection of documents. Experiments on two real-world Reuters 21578 and BBC News datasets show that KTM performance is better than state-of-the-art topic models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis). The KTM is also applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.



Author(s):  
Yiming Wang ◽  
Ximing Li ◽  
Jihong Ouyang

Neural topic modeling provides a flexible, efficient, and powerful way to extract topic representations from text documents. Unfortunately, most existing models cannot handle the text data with network links, such as web pages with hyperlinks and scientific papers with citations. To resolve this kind of data, we develop a novel neural topic model , namely Layer-Assisted Neural Topic Model (LANTM), which can be interpreted from the perspective of variational auto-encoders. Our major motivation is to enhance the topic representation encoding by not only using text contents, but also the assisted network links. Specifically, LANTM encodes the texts and network links to the topic representations by an augmented network with graph convolutional modules, and decodes them by maximizing the likelihood of the generative process. The neural variational inference is adopted for efficient inference. Experimental results validate that LANTM significantly outperforms the existing models on topic quality, text classification and link prediction..



Sign in / Sign up

Export Citation Format

Share Document