Large-scale mammalian genome rearrangements coincide with chromatin interactions

Abstract Motivation Genome rearrangements drastically change gene order along great stretches of a chromosome. There has been initial evidence that these apparently non-local events in the 1D sense may have breakpoints that are close in the 3D sense. We harness the power of the Double Cut and Join model of genome rearrangement, along with Hi-C chromosome conformation capture data to test this hypothesis between human and mouse. Results We devise novel statistical tests that show that indeed, rearrangement scenarios that transform the human into the mouse gene order are enriched for pairs of breakpoints that have frequent chromosome interactions. This is observed for both intra-chromosomal breakpoint pairs, as well as for inter-chromosomal pairs. For intra-chromosomal rearrangements, the enrichment exists from close (<20 Mb) to very distant (100 Mb) pairs. Further, the pattern exists across multiple cell lines in Hi-C data produced by different laboratories and at different stages of the cell cycle. We show that similarities in the contact frequencies between these many experiments contribute to the enrichment. We conclude that either (i) rearrangements usually involve breakpoints that are spatially close or (ii) there is selection against rearrangements that act on spatially distant breakpoints. Availability and implementation Our pipeline is freely available at https://bitbucket.org/thekswenson/locality. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Super short operations on both gene order and intergenic sizes

Algorithms for Molecular Biology ◽

10.1186/s13015-019-0156-5 ◽

2019 ◽

Vol 14 (1) ◽

Cited By ~ 1

Author(s):

Andre R. Oliveira ◽

Géraldine Jean ◽

Guillaume Fertin ◽

Ulisses Dias ◽

Zanoni Dias

Keyword(s):

Approximation Algorithms ◽

Gene Order ◽

Genome Rearrangement ◽

Unit Cost ◽

Genome Rearrangements ◽

Minimum Length ◽

Approximation Factor ◽

A Genome ◽

Number Of Genes ◽

Intergenic Regions

Abstract Background The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called genome rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the presence of sequences of nucleotides between genes, called intergenic regions. Results and conclusions In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called super short reversals (SSRs) and super short transpositions (SSTs), which affect up to two (consecutive) genes. We denote by super short operations (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2.

Download Full-text

capC-MAP: software for analysis of Capture-C data

Bioinformatics ◽

10.1093/bioinformatics/btz480 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4773-4775 ◽

Cited By ~ 1

Author(s):

Adam Buckle ◽

Nick Gilbert ◽

Davide Marenduzzo ◽

Chris A Brackley

Keyword(s):

Software Package ◽

Experimental Methods ◽

Ease Of Use ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Chromosome Conformation ◽

Chromatin Interactions ◽

Genome Wide ◽

Genomic Locations

Abstract Summary Capture-C is a member of the chromosome-conformation-capture family of experimental methods which probes the 3D organization of chromosomes within the cell nucleus. It provides high-resolution information on the genome-wide chromatin interactions from a set of ‘target’ genomic locations, and is growing in popularity as a tool for improving our understanding of cis-regulation and gene function. Yet, analysis of the data is complicated, and to date there has been no dedicated or easy-to-use software to automate the process. We present capC-MAP, a software package for the analysis of Capture-C data. Availability and implementation Implemented with both ease of use and flexibility in mind, capC-MAP is a suit of programs written in C++ and Python, where each program can be run separately, or an entire analysis can be performed with a single command line. It is available under an open-source licence at https://github.com/cbrackley/capC-MAP, as well as via the conda package manager, and should run on any standard Unix-style system. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Deciphering hierarchical organization of topologically associated domains through change-point testing

BMC Bioinformatics ◽

10.1186/s12859-021-04113-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Haipeng Xing ◽

Yingru Wu ◽

Michael Q. Zhang ◽

Yong Chen

Keyword(s):

Large Scale ◽

Negative Binomial ◽

Mixture Distribution ◽

Hierarchical Organization ◽

Chromatin Interaction ◽

Interaction Matrix ◽

Good Precision ◽

Chromosome Conformation ◽

Chromatin Interactions ◽

Topologically Associated Domains

Abstract Background The nucleus of eukaryotic cells spatially packages chromosomes into a hierarchical and distinct segregation that plays critical roles in maintaining transcription regulation. High-throughput methods of chromosome conformation capture, such as Hi-C, have revealed topologically associating domains (TADs) that are defined by biased chromatin interactions within them. Results We introduce a novel method, HiCKey, to decipher hierarchical TAD structures in Hi-C data and compare them across samples. We first derive a generalized likelihood-ratio (GLR) test for detecting change-points in an interaction matrix that follows a negative binomial distribution or general mixture distribution. We then employ several optimal search strategies to decipher hierarchical TADs with p values calculated by the GLR test. Large-scale validations of simulation data show that HiCKey has good precision in recalling known TADs and is robust against random collisions of chromatin interactions. By applying HiCKey to Hi-C data of seven human cell lines, we identified multiple layers of TAD organization among them, but the vast majority had no more than four layers. In particular, we found that TAD boundaries are significantly enriched in active chromosomal regions compared to repressed regions. Conclusions HiCKey is optimized for processing large matrices constructed from high-resolution Hi-C experiments. The method and theoretical result of the GLR test provide a general framework for significance testing of similar experimental chromatin interaction data that may not fully follow negative binomial distributions but rather more general mixture distributions.

Download Full-text

decorate: differential epigenetic correlation test

Bioinformatics ◽

10.1093/bioinformatics/btaa067 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2856-2861

Author(s):

Gabriel E Hoffman ◽

Jaroslav Bendl ◽

Kiran Girdhar ◽

Panos Roussos

Keyword(s):

Large Scale ◽

Statistical Tests ◽

Computational Cost ◽

R Package ◽

Supplementary Information ◽

Expression Data ◽

Correlation Test ◽

Disease Biology ◽

Genome Wide ◽

Insight Into

Abstract Motivation Identifying correlated epigenetic features and finding differences in correlation between individuals with disease compared to controls can give novel insight into disease biology. This framework has been successful in analysis of gene expression data, but application to epigenetic data has been limited by the computational cost, lack of scalable software and lack of robust statistical tests. Results Decorate, differential epigenetic correlation test, identifies correlated epigenetic features and finds clusters of features that are differentially correlated between two or more subsets of the data. The software scales to genome-wide datasets of epigenetic assays on hundreds of individuals. We apply decorate to four large-scale datasets of DNA methylation, ATAC-seq and histone modification ChIP-seq. Availability and implementation decorate R package is available from https://github.com/GabrielHoffman/decorate. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unique mitochondrial gene order in Xenodermichthys copei (Alepocephalidae: Otocephala) – a first observation of a large-scale rearranged 16S–WANCY region in vertebrates

Mitochondrial DNA Part B ◽

10.1080/23802359.2018.1551082 ◽

2019 ◽

Vol 4 (1) ◽

pp. 511-514

Author(s):

Jan Yde Poulsen ◽

Tetsuya Sado ◽

Masaki Miya

Keyword(s):

Gene Order ◽

Large Scale ◽

Mitochondrial Gene ◽

Mitochondrial Gene Order

Download Full-text

Parallelized calculation of permutation tests

Bioinformatics ◽

10.1093/bioinformatics/btaa1007 ◽

2020 ◽

Author(s):

Markus Ekvall ◽

Michael Höhle ◽

Lukas Käll

Keyword(s):

Dynamic Programming ◽

Sample Size ◽

Permutation Test ◽

Statistical Tests ◽

Permutation Tests ◽

Supplementary Information ◽

Attractive Alternative ◽

Test Statistic ◽

Sample Distribution ◽

Running Time

Abstract Motivation Permutation tests offer a straightforward framework to assess the significance of differences in sample statistics. A significant advantage of permutation tests are the relatively few assumptions about the distribution of the test statistic are needed, as they rely on the assumption of exchangeability of the group labels. They have great value, as they allow a sensitivity analysis to determine the extent to which the assumed broad sample distribution of the test statistic applies. However, in this situation, permutation tests are rarely applied because the running time of naïve implementations is too slow and grows exponentially with the sample size. Nevertheless, continued development in the 1980s introduced dynamic programming algorithms that compute exact permutation tests in polynomial time. Albeit this significant running time reduction, the exact test has not yet become one of the predominant statistical tests for medium sample size. Here, we propose a computational parallelization of one such dynamic programming-based permutation test, the Green algorithm, which makes the permutation test more attractive. Results Parallelization of the Green algorithm was found possible by non-trivial rearrangement of the structure of the algorithm. A speed-up—by orders of magnitude—is achievable by executing the parallelized algorithm on a GPU. We demonstrate that the execution time essentially becomes a non-issue for sample sizes, even as high as hundreds of samples. This improvement makes our method an attractive alternative to, e.g. the widely used asymptotic Mann-Whitney U-test. Availabilityand implementation In Python 3 code from the GitHub repository https://github.com/statisticalbiotechnology/parallelPermutationTest under an Apache 2.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TreeMerge: a new method for improving the scalability of species tree estimation methods

Bioinformatics ◽

10.1093/bioinformatics/btz344 ◽

2019 ◽

Vol 35 (14) ◽

pp. i417-i426 ◽

Cited By ~ 7

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Species Tree ◽

New Method ◽

Divide And Conquer ◽

Supplementary Information ◽

Estimation Methods ◽

Running Time ◽

Tree Estimation ◽

Computationally Intensive ◽

A Minor

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences

Bioinformatics ◽

10.1093/bioinformatics/btab025 ◽

2021 ◽

Author(s):

Ting-Hsuan Wang ◽

Cheng-Ching Huang ◽

Jui-Hung Hung

Keyword(s):

Open Source Software ◽

Large Scale ◽

A Priori ◽

Supplementary Information ◽

Supplementary Data ◽

Comparable Accuracy ◽

Meta Analyses ◽

Next Generation Sequencing Ngs ◽

Adapter Trimming ◽

Generation Sequencing

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TADOSS: computational estimation of tandem domain swap stability

Bioinformatics ◽

10.1093/bioinformatics/bty974 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2507-2508 ◽

Cited By ~ 2

Author(s):

Aleix Lafita ◽

Pengfei Tian ◽

Robert B Best ◽

Alex Bateman

Keyword(s):

Large Scale ◽

Domain Swapping ◽

Coarse Grained ◽

Supplementary Information ◽

Simulation Studies ◽

Computational Tools ◽

Domain Swap ◽

Computational Estimation ◽

High Propensity ◽

The Stability

Abstract Summary Proteins with highly similar tandem domains have shown an increased propensity for misfolding and aggregation. Several molecular explanations have been put forward, such as swapping of adjacent domains, but there is a lack of computational tools to systematically analyze them. We present the TAndem DOmain Swap Stability predictor (TADOSS), a method to computationally estimate the stability of tandem domain-swapped conformations from the structures of single domains, based on previous coarse-grained simulation studies. The tool is able to discriminate domains susceptible to domain swapping and to identify structural regions with high propensity to form hinge loops. TADOSS is a scalable method and suitable for large scale analyses. Availability and implementation Source code and documentation are freely available under an MIT license on GitHub at https://github.com/lafita/tadoss. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GWASpro: a high-performance genome-wide association analysis server

Bioinformatics ◽

10.1093/bioinformatics/bty989 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2512-2514 ◽

Cited By ~ 4

Author(s):

Bongsong Kim ◽

Xinbin Dai ◽

Wenchao Zhang ◽

Zhaohong Zhuang ◽

Darlene L Sanchez ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Linear Mixed Model ◽

Association Studies ◽

Learning Curves ◽

Experimental Designs ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text