scholarly journals K-mer Motif Multinomial Mixtures, a scalable framework for multiple motif discovery

2016 ◽  
Author(s):  
Brian L. Trippe ◽  
Sandhya Prabhakaran ◽  
Harmen J. Bussemaker

1AbstractMotivationThe advent of inexpensive high-throughput sequencing (HTS) places new demands on motif discovery algorithms. To confront the challenges and embrace the opportunities presented by the growing wealth of information tied up in HTS datasets, we developed K-mer motif multinomial mixtures (KMMMs), a flexible class of Bayesian models for identifying multiple motifs in sequence sets using K-mer tables. Advantages of this framework are inference with time and space complexities that only scale with K, and the ability to be incorporated into larger Bayesian models.ResultsWe derived a class of probabilistic models of K-mer tables generated from sequence containing multiple motifs. KMMMs model the K-mer table as a multinomial mixture, with motif and background components, which are distributions over K-mers overlapping with each of the latent motifs and over K-mers that do not overlap with any motif, respectively. The framework casts motif discovery as a posterior inference problem, and we present several approximate inference methods that provide accurate reconstructions of motifs in synthetic data. Finally we apply the method to discover motifs in DNAse hypersensitive sites and ChIP-seq peaks obtained from the ENCODE project.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Xinyu Li ◽  
Wei Zhang ◽  
Jianming Zhang ◽  
Guang Li

Abstract Background Given expression data, gene regulatory network(GRN) inference approaches try to determine regulatory relations. However, current inference methods ignore the inherent topological characters of GRN to some extent, leading to structures that lack clear biological explanation. To increase the biophysical meanings of inferred networks, this study performed data-driven module detection before network inference. Gene modules were identified by decomposition-based methods. Results ICA-decomposition based module detection methods have been used to detect functional modules directly from transcriptomic data. Experiments about time-series expression, curated and scRNA-seq datasets suggested that the advantages of the proposed ModularBoost method over established methods, especially in the efficiency and accuracy. For scRNA-seq datasets, the ModularBoost method outperformed other candidate inference algorithms. Conclusions As a complicated task, GRN inference can be decomposed into several tasks of reduced complexity. Using identified gene modules as topological constraints, the initial inference problem can be accomplished by inferring intra-modular and inter-modular interactions respectively. Experimental outcomes suggest that the proposed ModularBoost method can improve the accuracy and efficiency of inference algorithms by introducing topological constraints.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Sebastian Racedo ◽  
Ivan Portnoy ◽  
Jorge I. Vélez ◽  
Homero San-Juan-Vergara ◽  
Marco Sanjuan ◽  
...  

Abstract Background High-throughput sequencing enables the analysis of the composition of numerous biological systems, such as microbial communities. The identification of dependencies within these systems requires the analysis and assimilation of the underlying interaction patterns between all the variables that make up that system. However, this task poses a challenge when considering the compositional nature of the data coming from DNA-sequencing experiments because traditional interaction metrics (e.g., correlation) produce unreliable results when analyzing relative fractions instead of absolute abundances. The compositionality-associated challenges extend to the classification task, as it usually involves the characterization of the interactions between the principal descriptive variables of the datasets. The classification of new samples/patients into binary categories corresponding to dissimilar biological settings or phenotypes (e.g., control and cases) could help researchers in the development of treatments/drugs. Results Here, we develop and exemplify a new approach, applicable to compositional data, for the classification of new samples into two groups with different biological settings. We propose a new metric to characterize and quantify the overall correlation structure deviation between these groups and a technique for dimensionality reduction to facilitate graphical representation. We conduct simulation experiments with synthetic data to assess the proposed method’s classification accuracy. Moreover, we illustrate the performance of the proposed approach using Operational Taxonomic Unit (OTU) count tables obtained through 16S rRNA gene sequencing data from two microbiota experiments. Also, compare our method’s performance with that of two state-of-the-art methods. Conclusions Simulation experiments show that our method achieves a classification accuracy equal to or greater than 98% when using synthetic data. Finally, our method outperforms the other classification methods with real datasets from gene sequencing experiments.


2020 ◽  
Author(s):  
Jianhao Peng ◽  
Ullas V. Chembazhi ◽  
Sushant Bangru ◽  
Ian M. Traniello ◽  
Auinash Kalsotra ◽  
...  

AbstractMotivationWith the use of single-cell RNA sequencing (scRNA-Seq) technologies, it is now possible to acquire gene expression data for each individual cell in samples containing up to millions of cells. These cells can be further grouped into different states along an inferred cell differentiation path, which are potentially characterized by similar, but distinct enough, gene regulatory networks (GRNs). Hence, it would be desirable for scRNA-Seq GRN inference methods to capture the GRN dynamics across cell states. However, current GRN inference methods produce a unique GRN per input dataset (or independent GRNs per cell state), failing to capture these regulatory dynamics.ResultsWe propose a novel single-cell GRN inference method, named SimiC, that jointly infers the GRNs corresponding to each state. SimiC models the GRN inference problem as a LASSO optimization problem with an added similarity constraint, on the GRNs associated to contiguous cell states, that captures the inter-cell-state homogeneity. We show on a mouse hepatocyte single-cell data generated after partial hepatectomy that, contrary to previous GRN methods for scRNA-Seq data, SimiC is able to capture the transcription factor (TF) dynamics across liver regeneration, as well as the cell-level behavior for the regulatory program of each TF across cell states. In addition, on a honey bee scRNA-Seq experiment, SimiC is able to capture the increased heterogeneity of cells on whole-brain tissue with respect to a regional analysis tissue, and the TFs associated specifically to each sequenced tissue.AvailabilitySimiC is written in Python and includes an R API. It can be downloaded from https://github.com/jianhao2016/[email protected], [email protected] informationSupplementary data are available at the code repository.


2019 ◽  
Vol 70 (15) ◽  
pp. 3867-3879 ◽  
Author(s):  
Anneke Frerichs ◽  
Julia Engelhorn ◽  
Janine Altmüller ◽  
Jose Gutierrez-Marcos ◽  
Wolfgang Werr

Abstract Fluorescence-activated cell sorting (FACS) and assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) were combined to analyse the chromatin state of lateral organ founder cells (LOFCs) in the peripheral zone of the Arabidopsis apetala1-1 cauliflower-1 double mutant inflorescence meristem. On a genome-wide level, we observed a striking correlation between transposase hypersensitive sites (THSs) detected by ATAC-seq and DNase I hypersensitive sites (DHSs). The mostly expanded DHSs were often substructured into several individual THSs, which correlated with phylogenetically conserved DNA sequences or enhancer elements. Comparing chromatin accessibility with available RNA-seq data, THS change configuration was reflected by gene activation or repression and chromatin regions acquired or lost transposase accessibility in direct correlation with gene expression levels in LOFCs. This was most pronounced immediately upstream of the transcription start, where genome-wide THSs were abundant in a complementary pattern to established H3K4me3 activation or H3K27me3 repression marks. At this resolution, the combined application of FACS/ATAC-seq is widely applicable to detect chromatin changes during cell-type specification and facilitates the detection of regulatory elements in plant promoters.


Climate ◽  
2019 ◽  
Vol 7 (1) ◽  
pp. 4 ◽  
Author(s):  
Md Masud Hasan ◽  
Barry F. W. Croke ◽  
Fazlul Karim

Probabilistic models are useful tools in understanding rainfall characteristics, generating synthetic data and predicting future events. This study describes the results from an analysis on comparing the probabilistic nature of daily, monthly and seasonal rainfall totals using data from 1327 rainfall stations across Australia. The main objective of this research is to develop a relationship between parameters obtained from models fitted to daily, monthly and seasonal rainfall totals. The study also examined the possibility of estimating the parameters for daily data using fitted parameters to monthly rainfall. Three distributions within the Exponential Dispersion Model (EDM) family (Normal, Gamma and Poisson-Gamma) were found to be optimal for modelling the daily, monthly and seasonal rainfall total. Within the EDM family, Poisson-Gamma distributions were found optimal in most cases, whereas the normal distribution was rarely optimal except for the stations from the wet region. Results showed large differences between regional and seasonal ϕ-index values (dispersion parameter), indicating the necessity of fitting separate models for each season. However, strong correlations were found between the parameters of combined data and those derived from individual seasons (0.70–0.81). This indicates the possibility of estimating parameters of individual season from the parameters of combined data. Such relationship has also been noticed for the parameters obtained through monthly and daily models. Findings of this research could be useful in understanding the probabilistic features of daily, monthly and seasonal rainfall and generating daily rainfall from monthly data for rainfall stations elsewhere.


2020 ◽  
Vol 36 (9) ◽  
pp. 2905-2906 ◽  
Author(s):  
Kevin R Shieh ◽  
Christina Kratschmer ◽  
Keith E Maier ◽  
John M Greally ◽  
Matthew Levy ◽  
...  

Abstract Summary High-throughput sequencing can enhance the analysis of aptamer libraries generated by the Systematic Evolution of Ligands by EXponential enrichment. Robust analysis of the resulting sequenced rounds is best implemented by determining a ranked consensus of reads following the processing by multiple aptamer detection algorithms. While several such approaches have been developed to this end, their installation and implementation is problematic. We developed AptCompare, a cross-platform program that combines six of the most widely used analytical approaches for the identification of RNA aptamer motifs and uses a simple weighted ranking to order the candidate aptamers, all driven within the same GUI-enabled environment. We demonstrate AptCompare’s performance by identifying the top-ranked candidate aptamers from a previously published selection experiment in our laboratory, with follow-up bench assays demonstrating good correspondence between the sequences’ rankings and their binding affinities. Availability and implementation The source code and pre-built virtual machine images are freely available at https://bitbucket.org/shiehk/aptcompare. Supplementary information Supplementary data are available at Bioinformatics online.


2003 ◽  
Vol 19 (5) ◽  
pp. 607-617 ◽  
Author(s):  
K. Blekas ◽  
D. I. Fotiadis ◽  
A. Likas

2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Yipu Zhang ◽  
Ping Wang

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.


2009 ◽  
Vol 21 (7) ◽  
pp. 2049-2081 ◽  
Author(s):  
Takashi Takenouchi ◽  
Shin Ishii

In this letter, we present new methods of multiclass classification that combine multiple binary classifiers. Misclassification of each binary classifier is formulated as a bit inversion error with probabilistic models by making an analogy to the context of information transmission theory. Dependence between binary classifiers is incorporated into our model, which makes a decoder a type of Boltzmann machine. We performed experimental studies using a synthetic data set, data sets from the UCI repository, and bioinformatics data sets, and the results show that the proposed methods are superior to the existing multiclass classification methods.


2016 ◽  
Author(s):  
Caleb Kipkurui Kibet ◽  
Philip Machanick

AbstractWe describe MARS (Motif Assessment and Ranking Suite), a web-based suite of tools used to evaluate and rank PWM-based motifs. The increased number of learned motif models that are spread across databases and in different PWM formats, leading to a choice dilemma among the users, is our motivation. This increase has been driven by the difficulty of modelling transcription factor binding sites and the advance in high-throughput sequencing technologies at a continually reducing cost. Therefore, several experimental techniques have been developed resulting in diverse motif-finding algorithms and databases. We collate a wide variety of available motifs into a benchmark database, including the corresponding experimental ChIP-seq and PBM data obtained from ENCODE and UniPROBE databases, respectively. The implemented tools include: a data-independent consistency-based motif assessment and ranking (CB-MAR), which is based on the idea that ‘correct motifs’ are more similar to each other while incorrect motifs will differ from each other; and a scoring and classification-based algorithms, which rank binding models by their ability to discriminate sequences known to contain binding sites from those without. The CB-MAR and scoring techniques have a 0.86 and 0.73 median rank correlation using ChIP-seq and PBM respectively. Best motifs selected by CB-MAR achieve a mean AUC of 0.75, comparable to those ranked by held out data at 0.76 – this is based on ChIP-seq motif discovery using five algorithms on 110 transcription factors. We have demonstrated the benefit of this web server in motif choice and ranking, as well as in motif discovery. It can be accessed at http://www.bioinf.ict.ru.ac.za/.


Sign in / Sign up

Export Citation Format

Share Document