scholarly journals Accurate loop calling for 3D genomic data with cLoops

Author(s):  
Yaqiang Cao ◽  
Zhaoxiong Chen ◽  
Xingwei Chen ◽  
Daosheng Ai ◽  
Guoyu Chen ◽  
...  

Abstract Motivation Sequencing-based 3D genome mapping technologies can identify loops formed by interactions between regulatory elements hundreds of kilobases apart. Existing loop-calling tools are mostly restricted to a single data type, with accuracy dependent on a predefined resolution contact matrix or called peaks, and can have prohibitive hardware costs. Results Here, we introduce cLoops (‘see loops’) to address these limitations. cLoops is based on the clustering algorithm cDBSCAN that directly analyzes the paired-end tags (PETs) to find candidate loops and uses a permuted local background to estimate statistical significance. These two data-type-independent processes enable loops to be reliably identified for both sharp and broad peak data, including but not limited to ChIA-PET, Hi-C, HiChIP and Trac-looping data. Loops identified by cLoops showed much less distance-dependent bias and higher enrichment relative to local regions than existing tools. Altogether, cLoops improves accuracy of detecting of 3D-genomic loops from sequencing data, is versatile, flexible, efficient, and has modest hardware requirements. Availability and implementation cLoops with documentation and example data are freely available at: https://github.com/YaqiangCao/cLoops. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Yaqiang Cao ◽  
Xingwei Chen ◽  
Daosheng Ai ◽  
Zhaoxiong Chen ◽  
Guoyu Chen ◽  
...  

AbstractSequencing-based 3D genome mapping technologies can identify loops formed by interactions between regulatory elements hundreds of kilobases apart. Existing loop-calling tools are mostly restricted to a single data type, with accuracy dependent on a pre-defined resolution contact matrix or called peaks, and can have prohibitive hardware costs. Here we introduce cLoops (‘see loops’) to address these limitations. cLoops is based on the clustering algorithm cDBSCAN that directly analyzes the paired-end tags (PETs) to find candidate loops and uses a permuted local background to estimate statistical significance. These two data-type-independent processes enable loops to be reliably identified for both sharp and broad peak data, including but not limited to ChIA-PET, Hi-C, HiChIP and Trac-looping data. Loops identified by cLoops showed much less distance-dependent bias and higher enrichment relative to local regions than existing tools. Altogether, cLoops improves accuracy of detecting of 3D-genomic loops from sequencing data, is versatile, flexible, efficient, and has modest hardware requirements, and is freely available at: https://github.com/YaqiangCao/cLoops.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Minji Kim ◽  
Meizhen Zheng ◽  
Simon Zhongyuan Tian ◽  
Byoungkoo Lee ◽  
Jeffrey H. Chuang ◽  
...  

AbstractThe single-molecule multiplex chromatin interaction data are generated by emerging 3D genome mapping technologies such as GAM, SPRITE, and ChIA-Drop. These datasets provide insights into high-dimensional chromatin organization, yet introduce new computational challenges. Thus, we developed MIA-Sig, an algorithmic solution based on signal processing and information theory. We demonstrate its ability to de-noise the multiplex data, assess the statistical significance of chromatin complexes, and identify topological domains and frequent inter-domain contacts. On chromatin immunoprecipitation (ChIP)-enriched data, MIA-Sig can clearly distinguish the protein-associated interactions from the non-specific topological domains. Together, MIA-Sig represents a novel algorithmic framework for multiplex chromatin interaction analysis.


2019 ◽  
Author(s):  
Minji Kim ◽  
Meizhen Zheng ◽  
Simon Zhongyuan Tian ◽  
Daniel Capurso ◽  
Byoungkoo Lee ◽  
...  

AbstractThe single-molecule multiplex chromatin interaction data generated by emerging non-ligation-based 3D genome mapping technologies provide novel insights into high dimensional chromatin organization, yet introduce new computational challenges. We developed MIA-Sig (https://github.com/TheJacksonLaboratory/mia-sig.git), an algorithmic framework to de-noise the data, assess the statistical significance of chromatin complexes, and identify topological domains and inter-domain contacts. On chromatin immunoprecipitation (ChIP)-enriched data, MIA-Sig can clearly distinguish the protein-associated interactions from the non-specific topological domains.


2019 ◽  
Vol 35 (20) ◽  
pp. 3931-3936 ◽  
Author(s):  
Xin Huang ◽  
Xudong Gao ◽  
Wanying Li ◽  
Shuai Jiang ◽  
Ruijiang Li ◽  
...  

Abstract Motivation During development of the mammalian embryo, histone modification H3K4me3 plays an important role in regulating gene expression and exhibits extensive reprograming on the parental genomes. In addition to these dramatic epigenetic changes, certain unchanging regulatory elements are also essential for embryonic development. Results Using large-scale H3K4me3 chromatin immunoprecipitation sequencing data, we identified a form of H3K4me3 that was present during all eight stages of the mouse embryo before implantation. This ‘stable H3K4me3’ was highly accessible and much longer than normal H3K4me3. Moreover, most of the stable H3K4me3 was in the promoter region and was enriched in higher chromatin architecture. Using in-depth analysis, we demonstrated that stable H3K4me3 was related to higher gene expression levels and transcriptional initiation during embryonic development. Furthermore, stable H3K4me3 was much more active in blood tumor cells than in normal blood cells, suggesting a potential mechanism of cancer progression. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (22) ◽  
pp. 4788-4790 ◽  
Author(s):  
Claudia Arnedo-Pac ◽  
Loris Mularoni ◽  
Ferran Muiños ◽  
Abel Gonzalez-Perez ◽  
Nuria Lopez-Bigas

Abstract Motivation Identification of the genomic alterations driving tumorigenesis is one of the main goals in oncogenomics research. Given the evolutionary principles of cancer development, computational methods that detect signals of positive selection in the pattern of tumor mutations have been effectively applied in the search for cancer genes. One of these signals is the abnormal clustering of mutations, which has been shown to be complementary to other signals in the detection of driver genes. Results We have developed OncodriveCLUSTL, a new sequence-based clustering algorithm to detect significant clustering signals across genomic regions. OncodriveCLUSTL is based on a local background model derived from the simulation of mutations accounting for the composition of tri- or penta-nucleotide context substitutions observed in the cohort under study. Our method can identify known clusters and bona-fide cancer drivers across cohorts of tumor whole-exomes, outperforming the existing OncodriveCLUST algorithm and complementing other methods based on different signals of positive selection. Our results indicate that OncodriveCLUSTL can be applied to the analysis of non-coding genomic elements and non-human mutations data. Availability and implementation OncodriveCLUSTL is available as an installable Python 3.5 package. The source code and running examples are freely available at https://bitbucket.org/bbglab/oncodriveclustl under GNU Affero General Public License. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (14) ◽  
pp. i99-i107 ◽  
Author(s):  
Qiao Liu ◽  
Hairong Lv ◽  
Rui Jiang

Abstract Motivation Hi-C is a genome-wide technology for investigating 3D chromatin conformation by measuring physical contacts between pairs of genomic regions. The resolution of Hi-C data directly impacts the effectiveness and accuracy of downstream analysis such as identifying topologically associating domains (TADs) and meaningful chromatin loops. High resolution Hi-C data are valuable resources which implicate the relationship between 3D genome conformation and function, especially linking distal regulatory elements to their target genes. However, high resolution Hi-C data across various tissues and cell types are not always available due to the high sequencing cost. It is therefore indispensable to develop computational approaches for enhancing the resolution of Hi-C data. Results We proposed hicGAN, an open-sourced framework, for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs). To the best of our knowledge, this is the first study to apply GANs to 3D genome analysis. We demonstrate that hicGAN effectively enhances the resolution of low resolution Hi-C data by generating matrices that are highly consistent with the original high resolution Hi-C matrices. A typical scenario of usage for our approach is to enhance low resolution Hi-C data in new cell types, especially where the high resolution Hi-C data are not available. Our study not only presents a novel approach for enhancing Hi-C data resolution, but also provides fascinating insights into disclosing complex mechanism underlying the formation of chromatin contacts. Availability and implementation We release hicGAN as an open-sourced software at https://github.com/kimmo1019/hicGAN. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Karolina Stępniak ◽  
Magdalena A. Machnicka ◽  
Jakub Mieczkowski ◽  
Anna Macioszek ◽  
Bartosz Wojtaś ◽  
...  

AbstractChromatin structure and accessibility, and combinatorial binding of transcription factors to regulatory elements in genomic DNA control transcription. Genetic variations in genes encoding histones, epigenetics-related enzymes or modifiers affect chromatin structure/dynamics and result in alterations in gene expression contributing to cancer development or progression. Gliomas are brain tumors frequently associated with epigenetics-related gene deregulation. We perform whole-genome mapping of chromatin accessibility, histone modifications, DNA methylation patterns and transcriptome analysis simultaneously in multiple tumor samples to unravel epigenetic dysfunctions driving gliomagenesis. Based on the results of the integrative analysis of the acquired profiles, we create an atlas of active enhancers and promoters in benign and malignant gliomas. We explore these elements and intersect with Hi-C data to uncover molecular mechanisms instructing gene expression in gliomas.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Givanna H Putri ◽  
Irena Koprinska ◽  
Thomas M Ashhurst ◽  
Nicholas J C King ◽  
Mark N Read

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Xinjun Li ◽  
Kristina Sundquist ◽  
Jan Sundquist ◽  
Asta Försti ◽  
Kari Hemminki

AbstractChildhood acute lymphoblastic leukemia (ALL) has an origin in the fetal period which may distinguish it from ALL diagnosed later in life. We wanted to test whether familial risks differ in ALL diagnosed in the very early childhood from ALL diagnosed later. The Swedish nation-wide family-cancer data were used until year 2016 to calculate standardized incidence ratios (SIRs) for familial risks in ALL in three diagnostic age-groups: 0–4, 5–34 and 35 + years. Among 1335 ALL patients diagnosed before age 5, familial risks were increased for esophageal (4.78), breast (1.42), prostate (1.40) and connective tissue (2.97) cancers and leukemia (2.51, ALL 7.81). In age-group 5–34 years, rectal (1.73) and endometrial (2.40) cancer, myeloma (2.25) and leukemia (2.00, ALL 4.60) reached statistical significance. In the oldest age-group, the only association was with Hodgkin lymphoma (3.42). Diagnostic ages of family members of ALL patients were significantly lower compared to these cancers in the population for breast, prostate and rectal cancers. The patterns of increased familial cancers suggest that BRCA2 mutations could contribute to associations of ALL with breast and prostate cancers, and mismatch gene PMS2 mutations with rectal and endometrial cancers. Future DNA sequencing data will be a test for these familial predictions.


Sign in / Sign up

Export Citation Format

Share Document