De novo gene evolution: How do we transition from non-coding to coding?

10.7287/peerj.preprints.3031v2 ◽

2017 ◽

Author(s):

Jorge Ruiz-Orera ◽

José Luis Villanueva-Cañas ◽

William Blevins ◽

M.Mar Albà

Keyword(s):

De Novo ◽

Gene Evolution ◽

Neutral Evolution ◽

Functional Protein ◽

Protein Coding ◽

Coding Sequences ◽

Sequence Composition ◽

Protein Coding Genes ◽

Small Proteins ◽

De Novo Gene

Recent years have witnessed the discovery of protein–coding genes which appear to have evolved de novo from previously non-coding sequences. This has changed the long-standing view that coding sequences can only evolve from other coding sequences. However, there are still many open questions regarding how new protein-coding sequences can arise from non-genic DNA. Two prerequisites for the birth of a new functional protein-coding gene are that the corresponding DNA fragment is transcribed and that it is also translated. Transcription is known to be pervasive in the genome, producing a large number of transcripts that do not correspond to conserved protein-coding genes, and which are usually annotated as long non-coding RNAs (lncRNA). Recently, sequencing of ribosome protected fragments (Ribo-Seq) has provided evidence that many of these transcripts actually translate small proteins. We have used mouse non-synonymous and synonymous variation data to estimate the strength of purifying selection acting on the translated open reading frames (ORFs). Whereas a subset of the lncRNAs are likely to actually be true protein-coding genes (and thus previously misclassified), the bulk of lncRNAs code for proteins which show variation patterns consistent with neutral evolution. We also show that the ORFs that have a more favorable, coding-like, sequence composition are more likely to be translated than other ORFs in lncRNAs. This study provides strong evidence that there is a large and ever-changing reservoir of lowly abundant proteins; some of these peptides may become useful and act as seeds for de novo gene evolution.

Download Full-text

How do we transition from non-coding to coding?

10.7287/peerj.preprints.3031v1 ◽

2017 ◽

Author(s):

Jorge Ruiz-Orera ◽

José Luis Villanueva-Cañas ◽

William Blevins ◽

M.Mar Albà

Keyword(s):

De Novo ◽

Gene Evolution ◽

Purifying Selection ◽

Neutral Evolution ◽

Functional Protein ◽

Protein Coding ◽

Coding Sequences ◽

Sequence Composition ◽

Protein Coding Genes ◽

Small Proteins

Recent years have witnessed the discovery of protein–coding genes which appear to have evolved de novo from previously non-coding sequences. This has changed the long-standing view that coding sequences can only evolve from other coding sequences. However, there are still many open questions regarding how new protein-coding sequences can arise from non-genic DNA. Two prerequisites for the birth of a new functional protein-coding gene are that the corresponding DNA fragment is transcribed and that it is also translated. Transcription is known to be pervasive in the genome, producing a large number of transcripts that do not correspond to conserved protein-coding genes, and which are usually annotated as long non-coding RNAs (lncRNA). Recently, sequencing of ribosome protected fragments (Ribo-Seq) has provided evidence that many of these transcripts actually translate small proteins. We have used mouse non-synonymous and synonymous variation data to estimate the strength of purifying selection acting on the translated open reading frames (ORFs). Whereas a subset of the lncRNAs are likely to actually be true protein-coding genes (and thus previously misclassified), the bulk of lncRNAs code for proteins which show variation patterns consistent with neutral evolution. We also show that the ORFs that have a more favorable, coding-like, sequence composition are more likely to be translated than other ORFs in lncRNAs. This study provides strong evidence that there is a large and ever-changing reservoir of lowly abundant proteins; some of these peptides may become useful and act as seeds for de novo gene evolution.

Download Full-text

Standard codon substitution models overestimate purifying selection for non-stationary data

10.7287/peerj.preprints.2218v1 ◽

2016 ◽

Author(s):

Benjamin D Kaehler ◽

Von Bing Yap ◽

Gavin A Huttley

Keyword(s):

Natural Selection ◽

De Novo ◽

Purifying Selection ◽

Neutral Evolution ◽

Protein Coding ◽

Synonymous Substitutions ◽

New Model ◽

Sequence Composition ◽

Codon Substitution ◽

Substitution Models

Estimation of natural selection on protein-coding sequences is a key comparative genomics approach for de novo prediction of lineage specific adaptations. Selective pressure is measured on a per-gene basis by comparing the rate of non-synonymous substitutions to the rate of neutral evolution, typically assumed to be the rate of synonymous substitutions. All published codon substitution models have been time-reversible and thus assume that sequence composition does not change over time. We previously demonstrated that if time-reversible DNA substitution models are applied blindly in the presence of changing sequence composition, the number of substitutions is systematically biased towards overestimation. We extend these findings to the case of codon substitution models and further demonstrate that the ratio of non-synonymous to synonymous rates of substitution tends to be underestimated over three data sets of insects, mammals, and vertebrates. Our basis for comparison is a non-stationary codon substitution model that allows sequence composition to change. Model selection and model fit results demonstrate that our new model tends to fit the data better. Direct measurement of non-stationarity shows that bias in estimates of natural selection and genetic distance increases with the degree of violation of the stationarity assumption. Additionally, inferences drawn under time-reversible models are systematically affected by compositional divergence. As genomic sequences accumulate at an accelerating rate, the importance of accurate de novo estimation of natural selection increases. Our results establish that our new model provides a more robust perspective on this fundamental quantity.

Download Full-text

De novoemergence of adaptive membrane proteins from thymine-rich intergenic sequences

10.1101/621532 ◽

2019 ◽

Author(s):

Nikolaos Vakirlis ◽

Omer Acar ◽

Brian Hsu ◽

Nelson Castilho Coelho ◽

S. Branden Van Oss ◽

...

Keyword(s):

De Novo ◽

Transmembrane Proteins ◽

Protein Coding ◽

Coding Sequences ◽

Beneficial Effects ◽

Protein Coding Genes ◽

Evolutionary Innovation ◽

Intergenic Sequences ◽

Intergenic Regions ◽

Novel Protein

SummaryRecent evidence demonstrates that novel protein-coding genes can arisede novofrom intergenic loci. This evolutionary innovation is thought to be facilitated by the pervasive translation of intergenic transcripts, which exposes a reservoir of variable polypeptides to natural selection. Do intergenic translation events yield polypeptides with useful biochemical capacities? The answer to this question remains controversial. Here, we systematically characterized howde novoemerging coding sequences impact fitness. In budding yeast, overexpression of these sequences was enriched in beneficial effects, while their disruption was generally inconsequential. We found that beneficial emerging sequences have a strong tendency to encode putative transmembrane proteins, which appears to stem from a cryptic propensity for transmembrane signals throughout thymine-rich intergenic regions of the genome. These findings suggest that novel genes with useful biochemical capacities, such as transmembrane domains, tend to evolvede novowithin intergenic loci that already harbored a blueprint for these capacities.

Download Full-text

From de novo to ‘de nono’: most novel protein coding genes identified with phylostratigraphy represent old genes or recent duplicates

10.1101/287193 ◽

2018 ◽

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Gc Content ◽

Protein Coding ◽

Protein Coding Genes ◽

Gene Sets ◽

De Novo Genes ◽

De Novo Gene ◽

Similarity Searches ◽

Novel Protein

AbstractThe evolution of novel protein-coding genes from noncoding regions of the genome is one of the most compelling evidence for genetic innovations in nature. One popular approach to identify de novo genes is phylostratigraphy, which consists of determining the approximate time of origin (age) of a gene based on its distribution along a species phylogeny. Several studies have revealed significant flaws in determining the age of genes, including de novo genes, using phylostratigraphy alone. However, the rate of false positives in de novo gene surveys, based on phylostratigraphy, remains unknown. Here, I re-analyze the findings from three studies, two of which identified tens to hundreds of rodent-specific de novo genes adopting a phylostratigraphy-centered approach. Most of the putative de novo genes discovered in these investigations are no longer included in recently updated mouse gene sets. Using a combination of synteny information and sequence similarity searches, I show that about 60% of the remaining 381 putative de novo genes share homology with genes from other vertebrates, originated through gene duplication, and/or share no synteny information with non-rodent mammals. These results led to an estimated rate of ∼12 de novo genes per million year in mouse. Contrary to a previous study (Wilson et al. 2017), I found no evidence supporting the preadaptation hypothesis of de novo gene formation. Nearly half of the de novo genes confirmed in this study are within older genes, indicating that co-option of preexisting regulatory regions and a higher GC content may facilitate the origin of novel genes.

Download Full-text

A Depletion of Stop Codons in lincRNA is Owing to Transfer of Selective Constraint from Coding Sequences

Molecular Biology and Evolution ◽

10.1093/molbev/msz299 ◽

2019 ◽

Vol 37 (4) ◽

pp. 1148-1164

Author(s):

Liam Abrahams ◽

Laurence D Hurst

Keyword(s):

De Novo ◽

Stop Codon ◽

Noncoding Rnas ◽

Selective Constraint ◽

Protein Coding ◽

Coding Sequences ◽

Reading Frame ◽

Stop Codons ◽

Protein Coding Genes ◽

Exonic Splice

Abstract Although the constraints on a gene’s sequence are often assumed to reflect the functioning of that gene, here we propose transfer selection, a constraint operating on one class of genes transferred to another, mediated by shared binding factors. We show that such transfer can explain an otherwise paradoxical depletion of stop codons in long intergenic noncoding RNAs (lincRNAs). Serine/arginine-rich proteins direct the splicing machinery by binding exonic splice enhancers (ESEs) in immature mRNA. As coding exons cannot contain stop codons in one reading frame, stop codons should be rare within ESEs. We confirm that the stop codon density (SCD) in ESE motifs is low, even accounting for nucleotide biases. Given that serine/arginine-rich proteins binding ESEs also facilitate lincRNA splicing, a low SCD could transfer to lincRNAs. As predicted, multiexon lincRNA exons are depleted in stop codons, a result not explained by open reading frame (ORF) contamination. Consistent with transfer selection, stop codon depletion in lincRNAs is most acute in exonic regions with the highest ESE density, disappears when ESEs are masked, is consistent with stop codon usage skews in ESEs, and is diminished in both single-exon lincRNAs and introns. Owing to low SCD, the maximum lengths of pseudo-ORFs frequently exceed null expectations. This has implications for ORF annotation and the evolution of de novo protein-coding genes from lincRNAs. We conclude that not all constraints operating on genes need be explained by the functioning of the gene but may instead be transferred owing to shared binding factors.

Download Full-text

Studying the dawn of de novo gene emergence in mice reveals fast integration of new genes into functional networks

10.1101/510214 ◽

2019 ◽

Cited By ~ 3

Author(s):

Chen Xie ◽

Cemalettin Bekpen ◽

Sven Künzel ◽

Maryam Keshavarz ◽

Rebecca Krebs-Wheaton ◽

...

Keyword(s):

De Novo ◽

Expression Patterns ◽

Transcriptional Networks ◽

Protein Coding ◽

Protein Coding Genes ◽

New Genes ◽

De Novo Gene ◽

Intergenic Sequences ◽

Genomic Analyses ◽

New Protein

AbstractThe de novo emergence of new transcripts has been well documented through genomic analyses. However, a functional analysis, especially of very young protein-coding genes, is still largely lacking. Here we focus on three loci that have evolved from previously intergenic sequences in the house mouse (Mus musculus) and are not present in its closest relatives. We have obtained knockouts and analyzed their phenotypes, including a deep transcriptomic analysis, based on a dedicated power analysis. We show that the transcriptional networks are significantly disturbed in the knockouts and that all three genes have effects on phenotypes that are related to their expression patterns. This includes behavioral effects, skeletal differences and the regulation of the reproduction cycle in females. Substitution analysis suggests that all three genes have directly obtained an activity, without new adaptive substitutions. Our findings support the hypothesis that de novo genes can quickly adopt functions without extensive adaptation.Impact statementNew protein-coding genes emerging out of non-coding sequences can become directly functional without signatures of adaptive protein changes

Download Full-text

Standard codon substitution models overestimate purifying selection for non-stationary data

10.7287/peerj.preprints.2218 ◽

2016 ◽

Author(s):

Benjamin D Kaehler ◽

Von Bing Yap ◽

Gavin A Huttley

Keyword(s):

Natural Selection ◽

De Novo ◽

Purifying Selection ◽

Neutral Evolution ◽

Protein Coding ◽

Synonymous Substitutions ◽

New Model ◽

Sequence Composition ◽

Codon Substitution ◽

Substitution Models

Estimation of natural selection on protein-coding sequences is a key comparative genomics approach for de novo prediction of lineage specific adaptations. Selective pressure is measured on a per-gene basis by comparing the rate of non-synonymous substitutions to the rate of neutral evolution, typically assumed to be the rate of synonymous substitutions. All published codon substitution models have been time-reversible and thus assume that sequence composition does not change over time. We previously demonstrated that if time-reversible DNA substitution models are applied blindly in the presence of changing sequence composition, the number of substitutions is systematically biased towards overestimation. We extend these findings to the case of codon substitution models and further demonstrate that the ratio of non-synonymous to synonymous rates of substitution tends to be underestimated over three data sets of insects, mammals, and vertebrates. Our basis for comparison is a non-stationary codon substitution model that allows sequence composition to change. Model selection and model fit results demonstrate that our new model tends to fit the data better. Direct measurement of non-stationarity shows that bias in estimates of natural selection and genetic distance increases with the degree of violation of the stationarity assumption. Additionally, inferences drawn under time-reversible models are systematically affected by compositional divergence. As genomic sequences accumulate at an accelerating rate, the importance of accurate de novo estimation of natural selection increases. Our results establish that our new model provides a more robust perspective on this fundamental quantity.

Download Full-text

From de novo to ‘de nono’: The majority of novel protein coding genes identified with phylostratigraphy are old genes or recent duplicates

Genome Biology and Evolution ◽

10.1093/gbe/evy231 ◽

2018 ◽

Cited By ~ 2

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Protein Coding ◽

Protein Coding Genes ◽

Novel Protein

Download Full-text

Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome

10.1101/847558 ◽

2019 ◽

Author(s):

Ryan Bracewell ◽

Anita Tran ◽

Kamalakar Chatla ◽

Doris Bachtrog

Keyword(s):

X Chromosome ◽

Genome Assembly ◽

De Novo ◽

Pericentromeric Region ◽

Species Group ◽

Chromosome 15 ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Read ◽

Chromosome Level

ABSTRACTThe Drosophila obscura species group is one of the most studied clades of Drosophila and harbors multiple distinct karyotypes. Here we present a de novo genome assembly and annotation of D. bifasciata, a species which represents an important subgroup for which no high-quality chromosome-level genome assembly currently exists. We combined long-read sequencing (Nanopore) and Hi-C scaffolding to achieve a highly contiguous genome assembly approximately 193Mb in size, with repetitive elements constituting 30.1% of the total length. Drosophila bifasciata harbors four large metacentric chromosomes and the small dot, and our assembly contains each chromosome in a single scaffold, including the highly repetitive pericentromere, which were largely composed of Jockey and Gypsy transposable elements. We annotated a total of 12,821 protein-coding genes and comparisons of synteny with D. athabasca orthologs show that the large metacentric pericentromeric regions of multiple chromosomes are conserved between these species. Importantly, Muller A (X chromosome) was found to be metacentric in D. bifasciata and the pericentromeric region appears homologous to the pericentromeric region of the fused Muller A-AD (XL and XR) of pseudoobscura/affinis subgroup species. Our finding suggests a metacentric ancestral X fused to a telocentric Muller D and created the large neo-X (Muller A-AD) chromosome ∼15 MYA. We also confirm the fusion of Muller C and D in D. bifasciata and show that it likely involved a centromere-centromere fusion.

Download Full-text