From de novo to ‘de nono’: most novel protein coding genes identified with phylostratigraphy represent old genes or recent duplicates

Mapping Intimacies ◽

10.1101/287193 ◽

2018 ◽

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Gc Content ◽

Protein Coding ◽

Protein Coding Genes ◽

Gene Sets ◽

De Novo Genes ◽

De Novo Gene ◽

Similarity Searches ◽

Novel Protein

AbstractThe evolution of novel protein-coding genes from noncoding regions of the genome is one of the most compelling evidence for genetic innovations in nature. One popular approach to identify de novo genes is phylostratigraphy, which consists of determining the approximate time of origin (age) of a gene based on its distribution along a species phylogeny. Several studies have revealed significant flaws in determining the age of genes, including de novo genes, using phylostratigraphy alone. However, the rate of false positives in de novo gene surveys, based on phylostratigraphy, remains unknown. Here, I re-analyze the findings from three studies, two of which identified tens to hundreds of rodent-specific de novo genes adopting a phylostratigraphy-centered approach. Most of the putative de novo genes discovered in these investigations are no longer included in recently updated mouse gene sets. Using a combination of synteny information and sequence similarity searches, I show that about 60% of the remaining 381 putative de novo genes share homology with genes from other vertebrates, originated through gene duplication, and/or share no synteny information with non-rodent mammals. These results led to an estimated rate of ∼12 de novo genes per million year in mouse. Contrary to a previous study (Wilson et al. 2017), I found no evidence supporting the preadaptation hypothesis of de novo gene formation. Nearly half of the de novo genes confirmed in this study are within older genes, indicating that co-option of preexisting regulatory regions and a higher GC content may facilitate the origin of novel genes.

Download Full-text

New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2014.0332 ◽

2015 ◽

Vol 370 (1678) ◽

pp. 20140332 ◽

Cited By ~ 72

Author(s):

Aoife McLysaght ◽

Daniele Guerzoni

Keyword(s):

De Novo ◽

Complex Structure ◽

Protein Coding ◽

Functional Roles ◽

Protein Coding Genes ◽

Evolutionary Innovation ◽

New Genes ◽

De Novo Genes ◽

Novel Protein

The origin of novel protein-coding genes de novo was once considered so improbable as to be impossible. In less than a decade, and especially in the last five years, this view has been overturned by extensive evidence from diverse eukaryotic lineages. There is now evidence that this mechanism has contributed a significant number of genes to genomes of organisms as diverse as Saccharomyces , Drosophila , Plasmodium , Arabidopisis and human. From simple beginnings, these genes have in some instances acquired complex structure, regulated expression and important functional roles. New genes are often thought of as dispensable late additions; however, some recent de novo genes in human can play a role in disease. Rather than an extremely rare occurrence, it is now evident that there is a relatively constant trickle of proto-genes released into the testing ground of natural selection. It is currently unknown whether de novo genes arise primarily through an ‘RNA-first’ or ‘ORF-first’ pathway. Either way, evolutionary tinkering with this pool of genetic potential may have been a significant player in the origins of lineage-specific traits and adaptations.

Download Full-text

From de novo to ‘de nono’: The majority of novel protein coding genes identified with phylostratigraphy are old genes or recent duplicates

Genome Biology and Evolution ◽

10.1093/gbe/evy231 ◽

2018 ◽

Cited By ~ 2

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Protein Coding ◽

Protein Coding Genes ◽

Novel Protein

Download Full-text

Draft genome assembly data of Anoxybacillus sp. strain MB8 isolated from Tattapani hot springs, India

10.1101/2021.06.09.447659 ◽

2021 ◽

Author(s):

VISHNU PRASOODANAN P K ◽

Shruti S. Menon ◽

Rituja Saxena ◽

Prashant Waiker ◽

Vineet K Sharma

Keyword(s):

Hot Springs ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Central India ◽

Glycoside Hydrolases ◽

Rrna Gene ◽

Aerobic Bacterium ◽

Protein Coding ◽

Protein Coding Genes

Discovery of novel thermophiles has shown promising applications in the field of biotechnology. Due to their thermal stability, they can survive the harsh processes in the industries, which make them important to be characterized and studied. Members of Anoxybacillus are alkaline tolerant thermophiles and have been extensively isolated from manure, dairy-processed plants, and geothermal hot springs. This article reports the assembled data of an aerobic bacterium Anoxybacillus sp. strain MB8, isolated from the Tattapani hot springs in Central India, where the 16S rRNA gene shares an identity of 97% (99% coverage) with Anoxybacillus kamchatkensis strain G10. The de novo assembly and annotation performed on the genome of Anoxybacillus sp. strain MB8 comprises of 2,898,780 bp (in 190 contigs) with a GC content of 41.8% and includes 2,976 protein-coding genes,1 rRNA operon, 73 tRNAs, 1 tm-RNA and 10 CRISPR arrays. The predicted protein-coding genes have been classified into 21 eggNOG categories. The KEGG Automated Annotation Server (KAAS) analysis indicated the presence of assimilatory sulfate reduction pathway, nitrate reducing pathway, and genes for glycoside hydrolases (GHs) and glycoside transferase (GTs). GHs and GTs hold widespread applications, in the baking and food industry for bread manufacturing, and in the paper, detergent and cosmetic industry. Hence, Anoxybacillus sp. strain MB8 holds the potential to be screened and characterized for such commercially relevant enzymes.

Download Full-text

Draft Genome of the Macadamia Husk Spot Pathogen, Pseudocercospora macadamiae

Phytopathology ◽

10.1094/phyto-12-19-0460-a ◽

2020 ◽

Vol 110 (9) ◽

pp. 1503-1506

Author(s):

Olufemi A. Akinsanmi ◽

Lilia C. Carvalhais

Keyword(s):

Plant Disease Resistance ◽

Plant Disease ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Disease Development ◽

Closely Related Species ◽

Protein Coding ◽

Protein Coding Genes ◽

The Family

Pseudocercospora macadamiae causes husk spot in macadamia in Australia. Lack of genomic resources for this pathogen has restricted acquiring knowledge on the mechanism of disease development, spread, and its role in fruit abscission. To address this gap, we sequenced the genome of P. macadamiae. The sequence was de novo assembled into a draft genome of 40 Mb, which is comparable to closely related species in the family Mycosphaerellaceae. The draft genome comprises 212 scaffolds, of which 99 scaffolds are over 50 kb. The genome has a 49% GC content and is predicted to contain 15,430 protein-coding genes. This draft genome sequence is the first for P. macadamiae and represents a valuable resource for understanding genome evolution and plant disease resistance.

Download Full-text

De novoemergence of adaptive membrane proteins from thymine-rich intergenic sequences

10.1101/621532 ◽

2019 ◽

Author(s):

Nikolaos Vakirlis ◽

Omer Acar ◽

Brian Hsu ◽

Nelson Castilho Coelho ◽

S. Branden Van Oss ◽

...

Keyword(s):

De Novo ◽

Transmembrane Proteins ◽

Protein Coding ◽

Coding Sequences ◽

Beneficial Effects ◽

Protein Coding Genes ◽

Evolutionary Innovation ◽

Intergenic Sequences ◽

Intergenic Regions ◽

Novel Protein

SummaryRecent evidence demonstrates that novel protein-coding genes can arisede novofrom intergenic loci. This evolutionary innovation is thought to be facilitated by the pervasive translation of intergenic transcripts, which exposes a reservoir of variable polypeptides to natural selection. Do intergenic translation events yield polypeptides with useful biochemical capacities? The answer to this question remains controversial. Here, we systematically characterized howde novoemerging coding sequences impact fitness. In budding yeast, overexpression of these sequences was enriched in beneficial effects, while their disruption was generally inconsequential. We found that beneficial emerging sequences have a strong tendency to encode putative transmembrane proteins, which appears to stem from a cryptic propensity for transmembrane signals throughout thymine-rich intergenic regions of the genome. These findings suggest that novel genes with useful biochemical capacities, such as transmembrane domains, tend to evolvede novowithin intergenic loci that already harbored a blueprint for these capacities.

Download Full-text

De novo gene evolution: How do we transition from non-coding to coding?

10.7287/peerj.preprints.3031 ◽

2017 ◽

Author(s):

Jorge Ruiz-Orera ◽

José Luis Villanueva-Cañas ◽

William Blevins ◽

M.Mar Albà

Keyword(s):

De Novo ◽

Gene Evolution ◽

Neutral Evolution ◽

Functional Protein ◽

Protein Coding ◽

Coding Sequences ◽

Sequence Composition ◽

Protein Coding Genes ◽

Small Proteins ◽

De Novo Gene

Recent years have witnessed the discovery of protein–coding genes which appear to have evolved de novo from previously non-coding sequences. This has changed the long-standing view that coding sequences can only evolve from other coding sequences. However, there are still many open questions regarding how new protein-coding sequences can arise from non-genic DNA. Two prerequisites for the birth of a new functional protein-coding gene are that the corresponding DNA fragment is transcribed and that it is also translated. Transcription is known to be pervasive in the genome, producing a large number of transcripts that do not correspond to conserved protein-coding genes, and which are usually annotated as long non-coding RNAs (lncRNA). Recently, sequencing of ribosome protected fragments (Ribo-Seq) has provided evidence that many of these transcripts actually translate small proteins. We have used mouse non-synonymous and synonymous variation data to estimate the strength of purifying selection acting on the translated open reading frames (ORFs). Whereas a subset of the lncRNAs are likely to actually be true protein-coding genes (and thus previously misclassified), the bulk of lncRNAs code for proteins which show variation patterns consistent with neutral evolution. We also show that the ORFs that have a more favorable, coding-like, sequence composition are more likely to be translated than other ORFs in lncRNAs. This study provides strong evidence that there is a large and ever-changing reservoir of lowly abundant proteins; some of these peptides may become useful and act as seeds for de novo gene evolution.

Download Full-text

Studying the dawn of de novo gene emergence in mice reveals fast integration of new genes into functional networks

10.1101/510214 ◽

2019 ◽

Cited By ~ 3

Author(s):

Chen Xie ◽

Cemalettin Bekpen ◽

Sven Künzel ◽

Maryam Keshavarz ◽

Rebecca Krebs-Wheaton ◽

...

Keyword(s):

De Novo ◽

Expression Patterns ◽

Transcriptional Networks ◽

Protein Coding ◽

Protein Coding Genes ◽

New Genes ◽

De Novo Gene ◽

Intergenic Sequences ◽

Genomic Analyses ◽

New Protein

AbstractThe de novo emergence of new transcripts has been well documented through genomic analyses. However, a functional analysis, especially of very young protein-coding genes, is still largely lacking. Here we focus on three loci that have evolved from previously intergenic sequences in the house mouse (Mus musculus) and are not present in its closest relatives. We have obtained knockouts and analyzed their phenotypes, including a deep transcriptomic analysis, based on a dedicated power analysis. We show that the transcriptional networks are significantly disturbed in the knockouts and that all three genes have effects on phenotypes that are related to their expression patterns. This includes behavioral effects, skeletal differences and the regulation of the reproduction cycle in females. Substitution analysis suggests that all three genes have directly obtained an activity, without new adaptive substitutions. Our findings support the hypothesis that de novo genes can quickly adopt functions without extensive adaptation.Impact statementNew protein-coding genes emerging out of non-coding sequences can become directly functional without signatures of adaptive protein changes

Download Full-text

De novo gene evolution: How do we transition from non-coding to coding?

10.7287/peerj.preprints.3031v2 ◽

2017 ◽

Author(s):

Jorge Ruiz-Orera ◽

José Luis Villanueva-Cañas ◽

William Blevins ◽

M.Mar Albà

Keyword(s):

De Novo ◽

Gene Evolution ◽

Neutral Evolution ◽

Functional Protein ◽

Protein Coding ◽

Coding Sequences ◽

Sequence Composition ◽

Protein Coding Genes ◽

Small Proteins ◽

De Novo Gene

Download Full-text

Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome

10.1101/847558 ◽

2019 ◽

Author(s):

Ryan Bracewell ◽

Anita Tran ◽

Kamalakar Chatla ◽

Doris Bachtrog

Keyword(s):

X Chromosome ◽

Genome Assembly ◽

De Novo ◽

Pericentromeric Region ◽

Species Group ◽

Chromosome 15 ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Read ◽

Chromosome Level

ABSTRACTThe Drosophila obscura species group is one of the most studied clades of Drosophila and harbors multiple distinct karyotypes. Here we present a de novo genome assembly and annotation of D. bifasciata, a species which represents an important subgroup for which no high-quality chromosome-level genome assembly currently exists. We combined long-read sequencing (Nanopore) and Hi-C scaffolding to achieve a highly contiguous genome assembly approximately 193Mb in size, with repetitive elements constituting 30.1% of the total length. Drosophila bifasciata harbors four large metacentric chromosomes and the small dot, and our assembly contains each chromosome in a single scaffold, including the highly repetitive pericentromere, which were largely composed of Jockey and Gypsy transposable elements. We annotated a total of 12,821 protein-coding genes and comparisons of synteny with D. athabasca orthologs show that the large metacentric pericentromeric regions of multiple chromosomes are conserved between these species. Importantly, Muller A (X chromosome) was found to be metacentric in D. bifasciata and the pericentromeric region appears homologous to the pericentromeric region of the fused Muller A-AD (XL and XR) of pseudoobscura/affinis subgroup species. Our finding suggests a metacentric ancestral X fused to a telocentric Muller D and created the large neo-X (Muller A-AD) chromosome ∼15 MYA. We also confirm the fusion of Muller C and D in D. bifasciata and show that it likely involved a centromere-centromere fusion.

Download Full-text

Integrating healthcare and research genetic data empowers the discovery of 28 novel developmental disorders

10.1101/797787 ◽

2019 ◽

Cited By ~ 14

Author(s):

Joanna Kaplanis ◽

Kaitlin E. Samocha ◽

Laurens Wiel ◽

Zhancheng Zhang ◽

Kevin J. Arvai ◽

...

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Genetic Data ◽

Statistical Test ◽

Integrated Healthcare ◽

Protein Coding ◽

Protein Coding Genes ◽

Clinical Diagnostic ◽

Simulation Based

SummaryDe novo mutations (DNMs) in protein-coding genes are a well-established cause of developmental disorders (DD). However, known DD-associated genes only account for a minority of the observed excess of such DNMs. To identify novel DD-associated genes, we integrated healthcare and research exome sequences on 31,058 DD parent-offspring trios, and developed a simulation-based statistical test to identify gene-specific enrichments of DNMs. We identified 285 significantly DD-associated genes, including 28 not previously robustly associated with DDs. Despite detecting more DD-associated genes than in any previous study, much of the excess of DNMs of protein-coding genes remains unaccounted for. Modelling suggests that over 1,000 novel DD-associated genes await discovery, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of dominant DDs.

Download Full-text