The coding potential of the human genome: global compositional properties identify with statistical significance a plethora of new potential coding regions

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.

Download Full-text

Coding Regions (of the Human Genome)

10.1007/springerreference_34716 ◽

2011 ◽

Keyword(s):

Human Genome ◽

Coding Regions

Download Full-text

Predicting Coding Potential from Genome Sequence: Application to Betaherpesviruses Infecting Rats and Mice

Journal of Virology ◽

10.1128/jvi.79.12.7570-7596.2005 ◽

2005 ◽

Vol 79 (12) ◽

pp. 7570-7596 ◽

Cited By ~ 46

Author(s):

Luciano Brocchieri ◽

Thomas N. Kledal ◽

Samuel Karlin ◽

Edward S. Mocarski

Keyword(s):

Genome Annotation ◽

Mrna Splicing ◽

Overlapping Genes ◽

Genome Sequences ◽

Protein Coding ◽

Coding Regions ◽

Translation Signals ◽

Rats And Mice ◽

Coding Potential ◽

Exon Gene

ABSTRACT Prediction of protein-coding regions and other features of primary DNA sequence have greatly contributed to experimental biology. Significant challenges remain in genome annotation methods, including the identification of small or overlapping genes and the assessment of mRNA splicing or unconventional translation signals in expression. We have employed a combined analysis of compositional biases and conservation together with frame-specific G+C representation to reevaluate and annotate the genome sequences of mouse and rat cytomegaloviruses. Our analysis predicts that there are at least 34 protein-coding regions in these genomes that were not apparent in earlier annotation efforts. These include 17 single-exon genes, three new exons of previously identified genes, a newly identified four-exon gene for a lectin-like protein (in rat cytomegalovirus), and 10 probable frameshift extensions of previously annotated genes. This expanded set of candidate genes provides an additional basis for investigation in cytomegalovirus biology and pathogenesis.

Download Full-text

Structural-Statistical Properties of DNA Coding Regions

Математическая биология и биоинформатика ◽

10.17537/2015.10.387 ◽

2015 ◽

Vol 10 (2) ◽

pp. 387-397 ◽

Cited By ~ 1

Author(s):

В.А. Кутыркин ◽

V.A. Kutyrkin

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Statistical Approach ◽

Statistical Properties ◽

Statistical Characteristics ◽

Dna Coding ◽

Coding Regions ◽

Unknown Type ◽

Triplet Periodicity ◽

Special Meaning

Structural-statistical characteristics of the coding DNA sequences (CDSs) from human genome are investigated in the frame of spectral-statistical approach (the 2S-approach). Properties of 3-regularity and latent profile periodicity are among of such the characteristics. Special meaning and intrinsic existence of these properties are confirmed by researching the binary recoded CDSs. The only one kind of singular recoding, that identifies complementary nucleotides, serves to persistence of the original CDSs characteristics. Usage of nonsingular binary recoding proves a statement that latent triplet periodicity in the CDSs of human genome belongs to earlier unknown type called as profile periodicity.

Download Full-text

Endogenous Retroviruses and Human Evolution

Comparative and Functional Genomics ◽

10.1002/cfg.216 ◽

2002 ◽

Vol 3 (6) ◽

pp. 494-498 ◽

Cited By ~ 30

Author(s):

Konstantin Khodosevich ◽

Yuri Lebedev ◽

Eugene Sverdlov

Keyword(s):

Human Genome ◽

Human Evolution ◽

Endogenous Retroviruses ◽

Regulatory Sequences ◽

Coding Regions ◽

Regulatory Systems ◽

Human Genes ◽

Functional Consequences ◽

Polyadenylation Signals ◽

Human Specific

Humans share about 99% of their genomic DNA with chimpanzees and bonobos; thus, the differences between these species are unlikely to be in gene content but could be caused by inherited changes in regulatory systems. Endogenous retroviruses (ERVs) comprise ∼ 5% of the human genome. The LTRs of ERVs contain many regulatory sequences, such as promoters, enhancers, polyadenylation signals and factor-binding sites. Thus, they can influence the expression of nearby human genes. All known human-specific LTRs belong to the HERV-K (human ERV) family, the most active family in the human genome. It is likely that some of these ERVs could have integrated into regulatory regions of the human genome, and therefore could have had an impact on the expression of adjacent genes, which have consequently contributed to human evolution. This review discusses possible functional consequences of ERV integration in active coding regions.

Download Full-text

A new parameter to study compositional properties of non-coding regions in eukaryotic genomes

Gene ◽

10.1016/j.gene.2006.05.030 ◽

2006 ◽

Vol 385 ◽

pp. 75-82 ◽

Cited By ~ 2

Author(s):

Emanuele Bultrini ◽

Elisabetta Pizzi

Keyword(s):

Coding Regions ◽

Compositional Properties ◽

Eukaryotic Genomes

Download Full-text

N6-Methyladenine DNA Modification in Human Genome

10.1101/176958 ◽

2017 ◽

Cited By ~ 1

Author(s):

Chuan-Le Xiao ◽

Song Zhu ◽

Minghui He ◽

De Chen ◽

Qian Zhang ◽

...

Keyword(s):

Human Genome ◽

Genomic Dna ◽

Human Cells ◽

Human Diseases ◽

Dna Modification ◽

Down Regulation ◽

Coding Regions

SummaryDNA N6-methyladenine (6mA) modification is the most prevalent DNA modification in prokaryotes, but whether it exists in human cells and whether it plays a role in human diseases remain enigmatic. Here, we showed that 6mA is extensively present in human genome, and we cataloged 881,240 6mA sites accounting for ∼0.051% of the total adenines. [G/C]AGG[C/T] was the most significantly associated motif with 6mA modification. 6mA sites were enriched in the coding regions and mark actively transcribed genes in human cells. We further found that DNA N6-methyladenine and N6-demethyladenine modification in human genome were mediated by methyltransferase N6AMT1 and demethylase ALKBH1, respectively. The abundance of 6mA was significantly lower in cancers, accompaning with decreased N6AMT1 and increased ALKBH1 levels, and down-regulation of 6mA modification levels promoted tumorigenesis. Collectively, our results demonstrate that DNA 6mA modification is extensively present in human cells and the decrease of genomic DNA 6mA promotes human tumorigenesis.

Download Full-text

Mutation severity spectrum of rare alleles in the human genome is predictive of disease type

10.1101/835462 ◽

2019 ◽

Author(s):

Jimin Pei ◽

Lisa Kinch ◽

Nick V. Grishin

Keyword(s):

Human Genome ◽

Genetic Disorders ◽

Single Amino Acid ◽

Missense Mutations ◽

Single Nucleotide ◽

Protein Coding ◽

Coding Regions ◽

Structural And Functional Properties ◽

Disease Associations ◽

Disease Associated Genes

AbstractThe human genome harbors a variety of genetic variations. Single-nucleotide changes that alter amino acids in protein-coding regions are one of the major causes of human phenotypic variation and diseases. These single-amino acid variations (SAVs) are routinely found in whole genome and exome sequencing. Evaluating the functional impact of such genomic alterations is crucial for diagnosis of genetic disorders. We developed DeepSAV, a deep-learning convolutional neural network to differentiate disease-causing and benign SAVs based on a variety of protein sequence, structural and functional properties. Our method outperforms most stand-alone programs and has similar predictive power as some of the best available. We transformed DeepSAV scores of rare SAVs observed in the general population into a mutation severity measure of protein-coding genes. This measure reflects a gene’s tolerance to deleterious missense mutations and serves as a useful tool to study gene-disease associations. Genes implicated in cancer, autism, and viral interaction are found by this measure as intolerant to mutations, while genes associated with a number of other diseases are scored as tolerant. Among known disease-associated genes, those that are mutation-intolerant are likely to function in development and signal transduction pathways, while those that are mutation-tolerant tend to encode metabolic and mitochondrial proteins.

Download Full-text

Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape

Nucleic Acids Research ◽

10.1093/nar/gku1280 ◽

2014 ◽

Vol 43 (4) ◽

pp. e27-e27 ◽

Cited By ~ 92

Author(s):

Aurélien Griffon ◽

Quentin Barbier ◽

Jordi Dalino ◽

Jacques van Helden ◽

Salvatore Spicuglia ◽

...

Keyword(s):

Human Genome ◽

Enrichment Analysis ◽

Search Space ◽

Regulatory Elements ◽

Data Sets ◽

Analysis Tool ◽

Coding Regions ◽

Genome Wide ◽

Public Data ◽

Cancer Genomes

Abstract The large collections of ChIP-seq data rapidly accumulating in public data warehouses provide genome-wide binding site maps for hundreds of transcription factors (TFs). However, the extent of the regulatory occupancy space in the human genome has not yet been fully apprehended by integrating public ChIP-seq data sets and combining it with ENCODE TFs map. To enable genome-wide identification of regulatory elements we have collected, analysed and retained 395 available ChIP-seq data sets merged with ENCODE peaks covering a total of 237 TFs. This enhanced repertoire complements and refines current genome-wide occupancy maps by increasing the human genome regulatory search space by 14% compared to ENCODE alone, and also increases the complexity of the regulatory dictionary. As a direct application we used this unified binding repertoire to annotate variant enhancer loci (VELs) from H3K4me1 mark in two cancer cell lines (MCF-7, CRC) and observed enrichments of specific TFs involved in biological key functions to cancer development and proliferation. Those enrichments of TFs within VELs provide a direct annotation of non-coding regions detected in cancer genomes. Finally, full access to this catalogue is available online together with the TFs enrichment analysis tool (http://tagc.univ-mrs.fr/remap/).

Download Full-text