scholarly journals From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data

F1000Research ◽  
2016 ◽  
Vol 4 ◽  
pp. 1080 ◽  
Author(s):  
Aaron T. L. Lun ◽  
Gordon K. Smyth

Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) is widely used to identify the genomic binding sites for protein of interest. Most conventional approaches to ChIP-seq data analysis involve the detection of the absolute presence (or absence) of a binding site. However, an alternative strategy is to identify changes in the binding intensity between two biological conditions, i.e., differential binding (DB). This may yield more relevant results than conventional analyses, as changes in binding can be associated with the biological difference being investigated. The aim of this article is to facilitate the implementation of DB analyses, by comprehensively describing a computational workflow for the detection of DB regions from ChIP-seq data. The workflow is based primarily on R software packages from the open-source Bioconductor project and covers all steps of the analysis pipeline, from alignment of read sequences to interpretation and visualization of putative DB regions. In particular, detection of DB regions will be conducted using the counts for sliding windows from the csaw package, with statistical modelling performed using methods in the edgeR package. Analyses will be demonstrated on real histone mark and transcription factor data sets. This will provide readers with practical usage examples that can be applied in their own studies.

F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 1080 ◽  
Author(s):  
Aaron T. L. Lun ◽  
Gordon K. Smyth

Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) is widely used to identify the genomic binding sites for protein of interest. Most conventional approaches to ChIP-seq data analysis involve the detection of the absolute presence (or absence) of a binding site. However, an alternative strategy is to identify changes in the binding intensity between two biological conditions, i.e., differential binding (DB). This may yield more relevant results than conventional analyses, as changes in binding can be associated with the biological difference being investigated. The aim of this article is to facilitate the implementation of DB analyses, by comprehensively describing a computational workflow for the detection of DB regions from ChIP-seq data. The workflow is based primarily on R software packages from the open-source Bioconductor project and covers all steps of the analysis pipeline, from alignment of read sequences to interpretation and visualization of putative DB regions. In particular, detection of DB regions will be conducted using the counts for sliding windows from the csaw package, with statistical modelling performed using methods in the edgeR package. Analyses will be demonstrated on real histone mark and transcription factor data sets. This will provide readers with practical usage examples that can be applied in their own studies.


2015 ◽  
Vol 44 (5) ◽  
pp. e45-e45 ◽  
Author(s):  
Aaron T.L. Lun ◽  
Gordon K. Smyth

Abstract Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) is widely used to identify binding sites for a target protein in the genome. An important scientific application is to identify changes in protein binding between different treatment conditions, i.e. to detect differential binding. This can reveal potential mechanisms through which changes in binding may contribute to the treatment effect. The csaw package provides a framework for the de novo detection of differentially bound genomic regions. It uses a window-based strategy to summarize read counts across the genome. It exploits existing statistical software to test for significant differences in each window. Finally, it clusters windows into regions for output and controls the false discovery rate properly over all detected regions. The csaw package can handle arbitrarily complex experimental designs involving biological replicates. It can be applied to both transcription factor and histone mark datasets, and, more generally, to any type of sequencing data measuring genomic coverage. csaw performs favorably against existing methods for de novo DB analyses on both simulated and real data. csaw is implemented as a R software package and is freely available from the open-source Bioconductor project.


Author(s):  
Martyna Daria Swiatczak

AbstractThis study assesses the extent to which the two main Configurational Comparative Methods (CCMs), i.e. Qualitative Comparative Analysis (QCA) and Coincidence Analysis (CNA), produce different models. It further explains how this non-identity is due to the different algorithms upon which both methods are based, namely QCA’s Quine–McCluskey algorithm and the CNA algorithm. I offer an overview of the fundamental differences between QCA and CNA and demonstrate both underlying algorithms on three data sets of ascending proximity to real-world data. Subsequent simulation studies in scenarios of varying sample sizes and degrees of noise in the data show high overall ratios of non-identity between the QCA parsimonious solution and the CNA atomic solution for varying analytical choices, i.e. different consistency and coverage threshold values and ways to derive QCA’s parsimonious solution. Clarity on the contrasts between the two methods is supposed to enable scholars to make more informed decisions on their methodological approaches, enhance their understanding of what is happening behind the results generated by the software packages, and better navigate the interpretation of results. Clarity on the non-identity between the underlying algorithms and their consequences for the results is supposed to provide a basis for a methodological discussion about which method and which variants thereof are more successful in deriving which search target.


2001 ◽  
Vol 33 (4) ◽  
pp. 529-549 ◽  
Author(s):  
Y. LE STRAT ◽  
J. C. THALABARD

A large multicentre epidemiological study was carried out by WHO between 1991 and 1995 to analyse the duration of lactational amenorrhoea in relation to breast-feeding. The main results of this analysis, which used classical statistical modelling, have been already published. However, some specific aspects of the postpartum fertility amenorrhoea and breast-feeding covariates, and more specifically the observed progressive exhaustion of the breast-feeding inhibitory effect on the reproductive axis, may justify a closer look at the validity of the statistical tools. Indeed, as has already been emphasized, analysis of large longitudinal data sets in reproduction often faces three difficulties: (i) the precise determination of the event of interest, (ii) the way to handle the time evolution of both the studied variables and their effect on the event of interest and (iii) the often discrete nature of the data and the associated problem of tied events. The first objective of the present work was to give additional insights into the estimation and quantification of the dynamics of the effect of breast-feeding over time, considering this covariate either as fixed or time-dependent. The second objective was to show how to perform the analyses using corresponding adapted procedures in widely available statistical packages, without the need for acquiring particular programming skills.


2019 ◽  
Vol 15 ◽  
pp. 117693431984907 ◽  
Author(s):  
Tomáš Farkaš ◽  
Jozef Sitarčík ◽  
Broňa Brejová ◽  
Mária Lucká

Computing similarity between 2 nucleotide sequences is one of the fundamental problems in bioinformatics. Current methods are based mainly on 2 major approaches: (1) sequence alignment, which is computationally expensive, and (2) faster, but less accurate, alignment-free methods based on various statistical summaries, for example, short word counts. We propose a new distance measure based on mathematical transforms from the domain of signal processing. To tolerate large-scale rearrangements in the sequences, the transform is computed across sliding windows. We compare our method on several data sets with current state-of-art alignment-free methods. Our method compares favorably in terms of accuracy and outperforms other methods in running time and memory requirements. In addition, it is massively scalable up to dozens of processing units without the loss of performance due to communication overhead. Source files and sample data are available at https://bitbucket.org/fiitstubioinfo/swspm/src


2012 ◽  
Vol 46 (1) ◽  
pp. 108-119 ◽  
Author(s):  
Simon W. M. Tanley ◽  
Antoine M. M. Schreurs ◽  
John R. Helliwell ◽  
Loes M. J. Kroon-Batenburg

The International Union of Crystallography has for many years been advocating archiving of raw data to accompany structural papers. Recently, it initiated the formation of the Diffraction Data Deposition Working Group with the aim of developing standards for the representation of these data. A means of studying this issue is to submit exemplar publications with associated raw data and metadata. A recent study on the effects of dimethyl sulfoxide on the binding of cisplatin and carboplatin to histidine in 11 different lysozyme crystals from two diffractometers led to an investigation of the possible effects of the equipment and X-ray diffraction data processing software on the calculated occupancies andBfactors of the bound Pt compounds. 35.3 Gb of data were transferred from Manchester to Utrecht to be processed withEVAL. A systematic comparison shows that the largest differences in the occupancies andBfactors of the bound Pt compounds are due to the software, but the equipment also has a noticeable effect. A detailed description of and discussion on the availability of metadata is given. By making these raw diffraction data sets availableviaa local depository, it is possible for the diffraction community to make their own evaluation as they may wish.


2016 ◽  
Vol 14 (06) ◽  
pp. 1650034 ◽  
Author(s):  
Naim Al Mahi ◽  
Munni Begum

One of the primary objectives of ribonucleic acid (RNA) sequencing or RNA-Seq experiment is to identify differentially expressed (DE) genes in two or more treatment conditions. It is a common practice to assume that all read counts from RNA-Seq data follow overdispersed (OD) Poisson or negative binomial (NB) distribution, which is sometimes misleading because within each condition, some genes may have unvarying transcription levels with no overdispersion. In such a case, it is more appropriate and logical to consider two sets of genes: OD and non-overdispersed (NOD). We propose a new two-step integrated approach to distinguish DE genes in RNA-Seq data using standard Poisson and NB models for NOD and OD genes, respectively. This is an integrated approach because this method can be merged with any other NB-based methods for detecting DE genes. We design a simulation study and analyze two real RNA-Seq data to evaluate the proposed strategy. We compare the performance of this new method combined with the three [Formula: see text]-software packages namely edgeR, DESeq2, and DSS with their default settings. For both the simulated and real data sets, integrated approaches perform better or at least equally well compared to the regular methods embedded in these [Formula: see text]-packages.


2014 ◽  
Vol 7 (7) ◽  
pp. 2273-2281 ◽  
Author(s):  
G. Fratini ◽  
M. Mauder

Abstract. A comparison of two popular eddy-covariance software packages is presented, namely, EddyPro and TK3. Two approximately 1-month long test data sets were processed, representing typical instrumental setups (i.e., CSAT3/LI-7500 above grassland and Solent R3/LI-6262 above a forest). The resulting fluxes and quality flags were compared. Achieving a satisfying agreement and understanding residual discrepancies required several iterations and interventions of different nature, spanning from simple software reconfiguration to actual code manipulations. In this paper, we document our comparison exercise and show that the two software packages can provide utterly satisfying agreement when properly configured. Our main aim, however, is to stress the complexity of performing a rigorous comparison of eddy-covariance software. We show that discriminating actual discrepancies in the results from inconsistencies in the software configuration requires deep knowledge of both software packages and of the eddy-covariance method. In some instances, it may be even beyond the possibility of the investigator who does not have access to and full knowledge of the source code. Being the developers of EddyPro and TK3, we could discuss the comparison at all levels of details and this proved necessary to achieve a full understanding. As a result, we suggest that researchers are more likely to get comparable results when using EddyPro (v5.1.1) and TK3 (v3.11) – at least with the setting presented in this paper – than they are when using any other pair of EC software which did not undergo a similar cross-validation. As a further consequence, we also suggest that, to the aim of assuring consistency and comparability of centralized flux databases, and for a confident use of eddy fluxes in synthesis studies on the regional, continental and global scale, researchers only rely on software that have been extensively validated in documented intercomparisons.


2017 ◽  
Author(s):  
Darrell O. Ricke ◽  
Steven Schwartz

AbstractHigh throughput sequencing (HTS) of DNA forensic samples is expanding from the sizing of short tandem repeats (STRs) to massively parallel sequencing (MPS). HTS panels are expanding from the FBI 20 core Combined DNA Index System (CODIS) loci to include SNPs. The calculation of random man not excluded, P(RMNE), is used in DNA mixture analysis to estimate the probability that a person is present in a DNA mixture. This calculation encounters calculation artifacts with expansion to larger panel sizes. Increasing the floating-point precision of the calculations allows for increased panel sizes but with a corresponding increase in computation time. The Taylor series higher precision libraries used fail on some input data sets leading to algorithm unreliability. Herein, a new formula is introduced for calculating P(RMNE) that scales to larger SNP panel sizes while being computationally efficient (patent pending)[1].


Sign in / Sign up

Export Citation Format

Share Document