The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes

Filippo Utro; Valeria Di Benedetto; Davide F.V. Corona; Raffaele Giancarlo

doi:10.1093/bioinformatics/btv679

The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes

Bioinformatics ◽

10.1093/bioinformatics/btv679 ◽

2015 ◽

Vol 32 (6) ◽

pp. 835-842 ◽

Cited By ~ 9

Author(s):

Filippo Utro ◽

Valeria Di Benedetto ◽

Davide F.V. Corona ◽

Raffaele Giancarlo

Keyword(s):

Closed Form ◽

Dna Sequence ◽

Chemical Properties ◽

Supplementary Information ◽

Information Theoretic ◽

Nucleosome Organization ◽

A Genome ◽

Intrinsic Complexity ◽

Mathematical Formulas ◽

Eukaryotic Genomes

Abstract Motivation: Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. Results: We contribute to close this important methodological gap between the two models by providing three very simple formulas for the sequence specific one. They are all based on well-known formulas in Computer Science and Bioinformatics, and they give different quantifications of how complex a sequence is. In view of how remarkably well they perform, it is very surprising that measures of sequence complexity have not even been considered as candidates to close the mentioned gap. We provide experimental evidence that the intrinsic level of combinatorial organization and information-theoretic content of subsequences within a genome are strongly correlated to the level of DNA encoded nucleosome organization discovered by Kaplan et al. Our results establish an important connection between the intrinsic complexity of subsequences in a genome and the intrinsic, i.e. DNA encoded, nucleosome organization of eukaryotic genomes. It is a first step towards a mathematical characterization of this latter ‘encoding’. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: [email protected].

Download Full-text

MOSGA: Modular Open-Source Genome Annotator

Bioinformatics ◽

10.1093/bioinformatics/btaa1003 ◽

2020 ◽

Author(s):

Roman Martin ◽

Thomas Hackl ◽

Georges Hattab ◽

Matthias G Fischer ◽

Dominik Heider

Keyword(s):

Open Source ◽

Source Code ◽

Supplementary Information ◽

Web Interface ◽

Fully Integrated ◽

Sequencing Technologies ◽

A Genome ◽

Wide Range ◽

User Friendly ◽

Eukaryotic Genomes

Abstract Motivation The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies—a crucial step toward unlocking the biology of the organism of interest—has remained a complex challenge that often requires advanced bioinformatics expertise. Results Here, we present MOSGA (Modular Open-Source Genome Annotator), a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. Availability and implementation We provide MOSGA as a web service at https://mosga.mathematik.uni-marburg.de and as a docker container at registry.gitlab.com/mosga/mosga: latest. Source code can be found at https://gitlab.com/mosga/mosga Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unique mobile elements and scalable gene flow at the prokaryote–eukaryote boundary revealed by circularized Asgard archaea genomes

Nature Microbiology ◽

10.1038/s41564-021-01039-y ◽

2022 ◽

Author(s):

Fabai Wu ◽

Daan R. Speth ◽

Alon Philosof ◽

Antoine Crémière ◽

Aditi Narayanan ◽

...

Keyword(s):

Viral Genome ◽

Scaling Laws ◽

Continuous Process ◽

Mobile Elements ◽

Gene Repertoire ◽

Bacterial Gene ◽

Size Dependent ◽

Sequence Of Events ◽

A Genome ◽

Eukaryotic Genomes

AbstractEukaryotic genomes are known to have garnered innovations from both archaeal and bacterial domains but the sequence of events that led to the complex gene repertoire of eukaryotes is largely unresolved. Here, through the enrichment of hydrothermal vent microorganisms, we recovered two circularized genomes of Heimdallarchaeum species that belong to an Asgard archaea clade phylogenetically closest to eukaryotes. These genomes reveal diverse mobile elements, including an integrative viral genome that bidirectionally replicates in a circular form and aloposons, transposons that encode the 5,000 amino acid-sized proteins Otus and Ephialtes. Heimdallaechaeal mobile elements have garnered various genes from bacteria and bacteriophages, likely playing a role in shuffling functions across domains. The number of archaea- and bacteria-related genes follow strikingly different scaling laws in Asgard archaea, exhibiting a genome size-dependent ratio and a functional division resembling the bacteria- and archaea-derived gene repertoire across eukaryotes. Bacterial gene import has thus likely been a continuous process unaltered by eukaryogenesis and scaled up through genome expansion. Our data further highlight the importance of viewing eukaryogenesis in a pan-Asgard context, which led to the proposal of a conceptual framework, that is, the Heimdall nucleation–decentralized innovation–hierarchical import model that accounts for the emergence of eukaryotic complexity.

Download Full-text

DNA methylation marks inter-nucleosome linker regions throughout the human genome

10.7287/peerj.preprints.27 ◽

2013 ◽

Author(s):

Benjamin P. Berman ◽

Yaping Liu ◽

Theresa K. Kelly

Keyword(s):

Dna Methylation ◽

Human Genome ◽

Ctcf Binding ◽

Gene Promoters ◽

Cpg Dinucleotides ◽

Sequencing Technologies ◽

Nucleosome Organization ◽

Genome Wide ◽

A Genome ◽

Sequencing Studies

Background: Nucleosome organization and DNA methylation are two mechanisms that are important for proper control of mammalian transcription, as well as epigenetic dysregulation associated with cancer. Whole-genome DNA methylation sequencing studies have found that methylation levels in the human genome show periodicities of approximately 190 bp, suggesting a genome-wide relationship between the two marks. A recent report (Chodavarapu et al., 2010) attributed this to higher methylation levels of DNA within nucleosomes. Here, we analyzed a number of published datasets and found a more compelling alternative explanation, namely that methylation levels are highest in linker regions between nucleosomes. Results: Reanalyzing the data from (Chodavarapu et al., 2010), we found that nucleosome-associated methylation could be strongly confounded by known sequence-related biases of the next-generation sequencing technologies. By accounting for these biases and using an unrelated nucleosome profiling technology, NOMe-seq, we found that genome-wide methylation was actually highest within linker regions occurring between nucleosomes in multi-nucleosome arrays. This effect was consistent among several methylation datasets generated independently using two unrelated methylation assays. Linker-associated methylation was most prominent within long Partially Methylated Domains (PMDs) and the positioned nucleosomes that flank CTCF binding sites. CTCF adjacent nucleosomes retained the correct positioning in regions completely devoid of CpG dinucleotides, suggesting that DNA methylation is not required for proper nucleosomes positioning. Conclusions: The biological mechanisms responsible for DNA methylation patterns outside of gene promoters remain poorly understood. We identified a significant genome-wide relationship between nucleosome organization and DNA methylation, which can be used to more accurately analyze and understand the epigenetic changes that accompany cancer and other diseases.

Download Full-text

The use of double fluorescence in situ hybridization to physically map the positions of 5S rDNA genes in relation to the chromosomal location of 18S–5.8S–26S rDNA and a C genome specific DNA sequence in the genus Avena

Genome ◽

10.1139/g96-068 ◽

1996 ◽

Vol 39 (3) ◽

pp. 535-542 ◽

Cited By ~ 63

Author(s):

Concha Linares ◽

Juan González ◽

Esther Ferrer ◽

Araceli Fominaya

Keyword(s):

In Situ Hybridization ◽

Dna Sequence ◽

Diploid Species ◽

5S Rdna ◽

Rdna Genes ◽

26S Rdna ◽

A Genome ◽

Rdna Loci ◽

C Genome

A physical map of the locations of the 5S rDNA genes and their relative positions with respect to 18S–5.8S–26S rDNA genes and a C genome specific repetitive DNA sequence was produced for the chromosomes of diploid, tetraploid, and hexaploid oat species using in situ hybridization. The A genome diploid species showed two pairs of rDNA loci and two pairs of 5S loci located on both arms of one pair of satellited chromosomes. The C genome diploid species showed two major pairs and one minor pair of rDNA loci. One pair of subtelocentric chromosomes carried rDNA and 5S loci physically separated on the long arm. The tetraploid species (AACC genomes) arising from these diploid ancestors showed two pairs of rDNA loci and three pairs of 5S loci. Two pairs of rDNA loci and 2 pairs of 5S loci were arranged as in the A genome diploid species. The third pair of 5S loci was located on one pair of A–C translocated chromosomes using simultaneous in situ hybridization with 5S rDNA genes and a C genome specific repetitive DNA sequence. The hexaploid species (AACCDD genomes) showed three pairs of rDNA loci and six pairs of 5S loci. One pair of 5S loci was located on each of two pairs of C–A/D translocated chromosomes. Comparative studies of the physical arrangement of rDNA and 5S loci in polyploid oats and the putative A and C genome progenitor species suggests that A genome diploid species could be the donor of both A and D genomes of polyploid oats. Key words : oats, 5S rDNA genes, 18S–5.8S–26S rDNA genes, C genome specific repetitive DNA sequence, in situ hybridization, genome evolution.

Download Full-text

A Closed-Form Expression for Outage Secrecy Capacity in Wireless Information-Theoretic Security

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Security in Emerging Wireless Communication and Networking Systems ◽

10.1007/978-3-642-11526-4_1 ◽

2010 ◽

pp. 3-12 ◽

Cited By ~ 6

Author(s):

Theofilos Chrysikos ◽

Tasos Dagiuklas ◽

Stavros Kotsopoulos

Keyword(s):

Closed Form ◽

Closed Form Expression ◽

Secrecy Capacity ◽

Form Expression ◽

Information Theoretic ◽

Information Theoretic Security

Download Full-text

Deep learning on chaos game representation for proteins

Bioinformatics ◽

10.1093/bioinformatics/btz493 ◽

2019 ◽

Vol 36 (1) ◽

pp. 272-279 ◽

Cited By ~ 5

Author(s):

Hannah F Löchel ◽

Dominic Eger ◽

Theodor Sperlea ◽

Dominik Heider

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Chemical Properties ◽

Protein Sequences ◽

Machine Learning Techniques ◽

Supplementary Information ◽

Support Vector ◽

Chaos Game Representation ◽

Chaos Game ◽

Game Representation

AbstractMotivationClassification of protein sequences is one big task in bioinformatics and has many applications. Different machine learning methods exist and are applied on these problems, such as support vector machines (SVM), random forests (RF) and neural networks (NN). All of these methods have in common that protein sequences have to be made machine-readable and comparable in the first step, for which different encodings exist. These encodings are typically based on physical or chemical properties of the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images. In this study, we compare the performance of SVMs, RFs and DNNs, trained on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, we modified it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons.ResultsWe could show that all applied machine learning techniques (RF, SVM and DNN) show promising results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming the other methods and that FCGR is a promising new encoding method for protein sequences.Availability and implementationhttps://cran.r-project.org/.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Accurate and efficient gene function prediction using a multi-bacterial network

Bioinformatics ◽

10.1093/bioinformatics/btaa885 ◽

2020 ◽

Author(s):

Jeffrey N Law ◽

Shiv D Kale ◽

T M Murali

Keyword(s):

Gene Function ◽

Bacterial Species ◽

Heterogeneous Data ◽

Function Prediction ◽

Label Propagation ◽

Supplementary Information ◽

Gene Function Prediction ◽

Functional Annotations ◽

A Genome ◽

Multiple Species

Abstract Motivation Nearly 40% of the genes in sequenced genomes have no experimentally or computationally derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally based functional annotations and systematically transfer them to newly sequenced organisms on a genome-wide scale. However, the large sizes of such networks pose a challenge for the scalability of current methods. Results We develop a label propagation algorithm called FastSinkSource. By formally bounding its rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under 3 min for 200 bacterial species. Availability and implementation An implementation of our framework and all data used in this research are available at https://github.com/Murali-group/multi-species-GOA-prediction. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ADDO: a comprehensive toolkit to detect, classify and visualize additive and non-additive quantitative trait loci

Bioinformatics ◽

10.1093/bioinformatics/btz786 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1517-1521

Author(s):

Leilei Cui ◽

Bin Yang ◽

Nikolas Pontikos ◽

Richard Mott ◽

Lusheng Huang

Keyword(s):

Quantitative Trait Loci ◽

Quantitative Trait ◽

Association Studies ◽

Genetic Effects ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Additive Effects ◽

A Genome ◽

Trait Loci ◽

Additive Genetic Effects

Abstract Motivation During the past decade, genome-wide association studies (GWAS) have been used to map quantitative trait loci (QTLs) underlying complex traits. However, most GWAS focus on additive genetic effects while ignoring non-additive effects, on the assumption that most QTL act additively. Consequently, QTLs driven by dominance and other non-additive effects could be overlooked. Results We developed ADDO, a highly efficient tool to detect, classify and visualize QTLs with additive and non-additive effects. ADDO implements a mixed-model transformation to control for population structure and unequal relatedness that accounts for both additive and dominant genetic covariance among individuals, and decomposes single-nucleotide polymorphism effects as either additive, partial dominant, dominant or over-dominant. A matrix multiplication approach is used to accelerate the computation: a genome scan on 13 million markers from 900 individuals takes about 5 h with 10 CPUs. Analysis of simulated data confirms ADDO’s performance on traits with different additive and dominance genetic variance components. We showed two real examples in outbred rat where ADDO identified significant dominant QTL that were not detectable by an additive model. ADDO provides a systematic pipeline to characterize additive and non-additive QTL in whole genome sequence data, which complements current mainstream GWAS software for additive genetic effects. Availability and implementation ADDO is customizable and convenient to install and provides extensive analytics and visualizations. The package is freely available online at https://github.com/LeileiCui/ADDO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Marsupial chromosomics: bridging the gap between genomes and chromosomes

Reproduction Fertility and Development ◽

10.1071/rd18201 ◽

2019 ◽

Vol 31 (7) ◽

pp. 1189 ◽

Cited By ~ 1

Author(s):

Janine E. Deakin ◽

Sally Potter

Keyword(s):

Dna Sequence ◽

Genome Assembly ◽

Genome Architecture ◽

Sequence Information ◽

Full Potential ◽

Tasmanian Devil ◽

Sequencing Technology ◽

A Genome ◽

Devil Facial Tumour Disease ◽

Chromosome Level

Marsupials have unique features that make them particularly interesting to study, and sequencing of marsupial genomes is helping to understand their evolution. A decade ago, it was a huge feat to sequence the first marsupial genome. Now, the advances in sequencing technology have made the sequencing of many more marsupial genomes possible. However, the DNA sequence is only one component of the structures it is packaged into: chromosomes. Knowing the arrangement of the DNA sequence on each chromosome is essential for a genome assembly to be used to its full potential. The importance of combining sequence information with cytogenetics has previously been demonstrated for rapidly evolving regions of the genome, such as the sex chromosomes, as well as for reconstructing the ancestral marsupial karyotype and understanding the chromosome rearrangements involved in the Tasmanian devil facial tumour disease. Despite the recent advances in sequencing technology assisting in genome assembly, physical anchoring of the sequence to chromosomes is required to achieve a chromosome-level assembly. Once chromosome-level assemblies are achieved for more marsupials, we will be able to investigate changes in the packaging and interactions between chromosomes to gain an understanding of the role genome architecture has played during marsupial evolution.

Download Full-text