New genes and functional innovation in mammals

Mapping Intimacies ◽

10.1101/090860 ◽

2016 ◽

Cited By ~ 1

Author(s):

José Luis Villanueva-Cañas ◽

Jorge Ruiz-Orera ◽

M.Isabel Agea ◽

Maria Gallo ◽

David Andreu ◽

...

Keyword(s):

De Novo ◽

Gene Families ◽

Specific Gene ◽

Protein Coding ◽

Evolutionary Innovation ◽

New Genes ◽

Recent Origin ◽

Mammalian Genes ◽

Genomic Regions ◽

New Protein

ABSTRACTThe birth of genes that encode new protein sequences is a major source of evolutionary innovation. However, we still understand relatively little about how these genes come into being and which functions they are selected for. To address these questions we have obtained a large collection of mammalian-specific gene families that lack homologues in other eukaryotic groups. We have combined gene annotations and de novo transcript assemblies from 30 different mamalian species, obtaining about 6,000 gene families. In general, the proteins in mammalian-specific gene families tend to be short and depleted in aromatic and negatively charged residues. Proteins which arose early in mammalian evolution include milk and skin polypeptides, immune response components, and proteins involved in reproduction. In contrast, the functions of proteins which have a more recent origin remain largely unknown, despite the fact that these proteins also have extensive proteomics support. We identify several previously described cases of genes originated de novo from non-coding genomic regions, supporting the idea that this mechanism frequently underlies the evolution of new protein-coding genes in mammals. Finally, we show that most young mammalian genes are preferentially expressed in testis, suggesting that sexual selection plays an important role in the emergence of new functional genes.

Faculty Opinions recommendation of New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.725762623.793527098 ◽

2017 ◽

Author(s):

Erich Bornberg-Bauer

Keyword(s):

De Novo ◽

Protein Coding ◽

Coding Sequence ◽

Protein Coding Genes ◽

Evolutionary Innovation ◽

New Genes

Studying the dawn of de novo gene emergence in mice reveals fast integration of new genes into functional networks

10.1101/510214 ◽

2019 ◽

Cited By ~ 3

Author(s):

Chen Xie ◽

Cemalettin Bekpen ◽

Sven Künzel ◽

Maryam Keshavarz ◽

Rebecca Krebs-Wheaton ◽

...

Keyword(s):

De Novo ◽

Expression Patterns ◽

Transcriptional Networks ◽

Protein Coding ◽

Protein Coding Genes ◽

New Genes ◽

De Novo Gene ◽

Intergenic Sequences ◽

Genomic Analyses ◽

New Protein

AbstractThe de novo emergence of new transcripts has been well documented through genomic analyses. However, a functional analysis, especially of very young protein-coding genes, is still largely lacking. Here we focus on three loci that have evolved from previously intergenic sequences in the house mouse (Mus musculus) and are not present in its closest relatives. We have obtained knockouts and analyzed their phenotypes, including a deep transcriptomic analysis, based on a dedicated power analysis. We show that the transcriptional networks are significantly disturbed in the knockouts and that all three genes have effects on phenotypes that are related to their expression patterns. This includes behavioral effects, skeletal differences and the regulation of the reproduction cycle in females. Substitution analysis suggests that all three genes have directly obtained an activity, without new adaptive substitutions. Our findings support the hypothesis that de novo genes can quickly adopt functions without extensive adaptation.Impact statementNew protein-coding genes emerging out of non-coding sequences can become directly functional without signatures of adaptive protein changes

New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2014.0332 ◽

2015 ◽

Vol 370 (1678) ◽

pp. 20140332 ◽

Cited By ~ 72

Author(s):

Aoife McLysaght ◽

Daniele Guerzoni

Keyword(s):

De Novo ◽

Complex Structure ◽

Protein Coding ◽

Functional Roles ◽

Protein Coding Genes ◽

Evolutionary Innovation ◽

New Genes ◽

De Novo Genes ◽

Novel Protein

The origin of novel protein-coding genes de novo was once considered so improbable as to be impossible. In less than a decade, and especially in the last five years, this view has been overturned by extensive evidence from diverse eukaryotic lineages. There is now evidence that this mechanism has contributed a significant number of genes to genomes of organisms as diverse as Saccharomyces , Drosophila , Plasmodium , Arabidopisis and human. From simple beginnings, these genes have in some instances acquired complex structure, regulated expression and important functional roles. New genes are often thought of as dispensable late additions; however, some recent de novo genes in human can play a role in disease. Rather than an extremely rare occurrence, it is now evident that there is a relatively constant trickle of proto-genes released into the testing ground of natural selection. It is currently unknown whether de novo genes arise primarily through an ‘RNA-first’ or ‘ORF-first’ pathway. Either way, evolutionary tinkering with this pool of genetic potential may have been a significant player in the origins of lineage-specific traits and adaptations.

Understanding the Early Evolutionary Stages of a Tandem Drosophilamelanogaster-Specific Gene Family: A Structural and Functional Population Study

Molecular Biology and Evolution ◽

10.1093/molbev/msaa109 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2584-2600 ◽

Cited By ~ 3

Author(s):

Bryan D Clifton ◽

Jamie Jimenez ◽

Ashlyn Kimura ◽

Zeinab Chahine ◽

Pablo Librado ◽

...

Keyword(s):

Gene Family ◽

Sequence Similarity ◽

Gene Families ◽

Read Depth ◽

Specific Gene ◽

Protein Variant ◽

Protein Coding ◽

Expression Levels ◽

Number Variation ◽

Reference Quality

Abstract Gene families underlie genetic innovation and phenotypic diversification. However, our understanding of the early genomic and functional evolution of tandemly arranged gene families remains incomplete as paralog sequence similarity hinders their accurate characterization. The Drosophila melanogaster-specific gene family Sdic is tandemly repeated and impacts sperm competition. We scrutinized Sdic in 20 geographically diverse populations using reference-quality genome assemblies, read-depth methodologies, and qPCR, finding that ∼90% of the individuals harbor 3–7 copies as well as evidence of population differentiation. In strains with reliable gene annotations, copy number variation (CNV) and differential transposable element insertions distinguish one structurally distinct version of the Sdic region per strain. All 31 annotated copies featured protein-coding potential and, based on the protein variant encoded, were categorized into 13 paratypes differing in their 3′ ends, with 3–5 paratypes coexisting in any strain examined. Despite widespread gene conversion, the only copy present in all strains has functionally diverged at both coding and regulatory levels under positive selection. Contrary to artificial tandem duplications of the Sdic region that resulted in increased male expression, CNV in cosmopolitan strains did not correlate with expression levels, likely as a result of differential genome modifier composition. Duplicating the region did not enhance sperm competitiveness, suggesting a fitness cost at high expression levels or a plateau effect. Beyond facilitating a minimally optimal expression level, Sdic CNV acts as a catalyst of protein and regulatory diversity, showcasing a possible evolutionary path recently formed tandem multigene families can follow toward long-term consolidation in eukaryotic genomes.

Rapidly evolving protointrons in Saccharomyces genomes revealed by a hungry spliceosome

10.1101/515197 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jason Talkish ◽

Haller Igel ◽

Rhonda J. Perriman ◽

Lily Shiue ◽

Sol Katzman ◽

...

Keyword(s):

Gene Expression ◽

Protein Isoforms ◽

Yeast Genome ◽

Protein Coding ◽

Coding Sequences ◽

New Genes ◽

Eukaryotic Genes ◽

Non Coding Rnas ◽

The Creation ◽

New Protein

AbstractIntrons are a prevalent feature of eukaryotic genomes, yet their origins and contributions to genome function and evolution remain mysterious. In budding yeast, repression of the highly transcribed intron-containing ribosomal protein genes (RPGs) globally increases splicing of non-RPG transcripts through reduced competition for the spliceosome. We show that under these “hungry spliceosome” conditions, splicing occurs at more than 150 previously unannotated locations we call protointrons that do not overlap known introns. Protointrons use a less constrained set of splice sites and branchpoints than standard introns, including in one case AT-AC in place of GT-AG. Protointrons are not conserved in all closely related species, suggesting that most are not under selection. Some are found in non-coding RNAs (e. g. CUTs and SUTs), where they may contribute to the creation of new genes. Others are found across boundaries between noncoding and coding sequences, or within coding sequences, where they offer pathways to the creation of new protein variants, or new regulatory controls for existing genes. We define protointrons as (1) nonconserved intron-like sequences that are (2) infrequently spliced, and importantly (3) are not currently understood to contribute to gene expression or regulation in the way that standard introns function. A very few protointrons in S. cerevisiae challenge this classification by their increased splicing frequency and potential function, consistent with the proposed evolutionary process of “intronization”, whereby new standard introns are created. This snapshot of intron evolution highlights the important role of the spliceosome in the expansion of transcribed genomic sequence space, providing a pathway for the rare events that may lead to the birth of new eukaryotic genes and the refinement of existing gene function.Author SummaryThe protein coding information in eukaryotic genes is broken by intervening sequences called introns that are removed from RNA during transcription by a large protein-RNA complex called the spliceosome. Where introns come from and how the spliceosome contributes to genome evolution are open questions. In this study, we find more than 150 new places in the yeast genome that are recognized by the spliceosome and spliced out as introns. Since they appear to have arisen very recently in evolution by sequence drift and do not appear to contribute to gene expression or its regulation, we call these protointrons. Protointrons are found in both protein-coding and non-coding RNAs and are not efficiently removed by the splicing machinery. Although most protointrons are not conserved, a few are spliced more efficiently, and are located where they might begin to play functional roles in gene expression, as predicted by the proposed process of intronization. The challenge now is to understand how spontaneously appearing splicing events like protointrons might contribute to the creation of new genes, new genetic controls, and new protein isoforms as genomes evolve.

A de novo evolved gene in the house mouse regulates female pregnancy cycles

eLife ◽

10.7554/elife.44392 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 4

Author(s):

Chen Xie ◽

Cemalettin Bekpen ◽

Sven Künzel ◽

Maryam Keshavarz ◽

Rebecca Krebs-Wheaton ◽

...

Keyword(s):

House Mouse ◽

De Novo ◽

Specific Protein ◽

Ribosome Profiling ◽

Mass Spectrometry Data ◽

Preimplantation Embryos ◽

Protein Coding ◽

Reading Frame ◽

Protein Coding Genes ◽

New Genes

The de novo emergence of new genes has been well documented through genomic analyses. However, a functional analysis, especially of very young protein-coding genes, is still largely lacking. Here, we identify a set of house mouse-specific protein-coding genes and assess their translation by ribosome profiling and mass spectrometry data. We functionally analyze one of them, Gm13030, which is specifically expressed in females in the oviduct. The interruption of the reading frame affects the transcriptional network in the oviducts at a specific stage of the estrous cycle. This includes the upregulation of Dcpp genes, which are known to stimulate the growth of preimplantation embryos. As a consequence, knockout females have their second litters after shorter times and have a higher infanticide rate. Given that Gm13030 shows no signs of positive selection, our findings support the hypothesis that a de novo evolved gene can directly adopt a function without much sequence adaptation.

New Genes Born-In or Invading Vertebrate Genomes

Frontiers in Cell and Developmental Biology ◽

10.3389/fcell.2021.713918 ◽

2021 ◽

Vol 9 ◽

Author(s):

Carlos Herrera-Úbeda ◽

Jordi Garcia-Fernàndez

Keyword(s):

Immune System ◽

Horizontal Gene Transfer ◽

Gene Networks ◽

De Novo ◽

Fundamental Question ◽

High Tolerance ◽

Virus Infections ◽

Vertebrate Lineage ◽

Protein Coding ◽

New Genes

Which is the origin of genes is a fundamental question in Biology, indeed a question older than the discovery of genes itself. For more than a century, it was uneven to think in origins other than duplication and divergence from a previous gene. In recent years, however, the intersection of genetics, embryonic development, and bioinformatics, has brought to light that de novo generation from non-genic DNA, horizontal gene transfer and, noticeably, virus and transposon invasions, have shaped current genomes, by integrating those newcomers into old gene networks, helping to shape morphological and physiological innovations. We here summarized some of the recent research in the field, mostly in the vertebrate lineage with a focus on protein-coding novelties, showing that the placenta, the adaptative immune system, or the highly developed neocortex, among other innovations, are linked to de novo gene creation or domestication of virus and transposons. We provocatively suggest that the high tolerance to virus infections by bats may also be related to previous virus and transposon invasions in the bat lineage.

Reference Genome for the Highly Transformable Setaria viridis ME034V

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401345 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3467-3478 ◽

Cited By ~ 2

Author(s):

Peter M. Thielen ◽

Amanda L. Pendleton ◽

Robert A. Player ◽

Kenneth V. Bowden ◽

Thomas J. Lawton ◽

...

Keyword(s):

De Novo ◽

Gene Families ◽

Model Organisms ◽

Phylogenomic Analysis ◽

Setaria Viridis ◽

Sequencing Technology ◽

Protein Coding ◽

Genotype Frequencies ◽

Green Foxtail ◽

Genome Assemblies

Setaria viridis (green foxtail) is an important model system for improving cereal crops due to its diploid genome, ease of cultivation, and use of C4 photosynthesis. The S. viridis accession ME034V is exceptionally transformable, but the lack of a sequenced genome for this accession has limited its utility. We present a 397 Mb highly contiguous de novo assembly of ME034V using ultra-long nanopore sequencing technology (read N50 = 41kb). We estimate that this genome is largely complete based on our updated k-mer based genome size estimate of 401 Mb for S. viridis. Genome annotation identified 37,908 protein-coding genes and >300k repetitive elements comprising 46% of the genome. We compared the ME034V assembly with two other previously sequenced Setaria genomes as well as to a diversity panel of 235 S. viridis accessions. We found the genome assemblies to be largely syntenic, but numerous unique polymorphic structural variants were discovered. Several ME034V deletions may be associated with recent retrotransposition of copia and gypsy LTR repeat families, as evidenced by their low genotype frequencies in the sampled population. Lastly, we performed a phylogenomic analysis to identify gene families that have expanded in Setaria, including those involved in specialized metabolism and plant defense response. The high continuity of the ME034V genome assembly validates the utility of ultra-long DNA sequencing to improve genetic resources for emerging model organisms. Structural variation present in Setaria illustrates the importance of obtaining the proper genome reference for genetic experiments. Thus, we anticipate that the ME034V genome will be of significant utility for the Setaria research community.

De novo emergence of adaptive membrane proteins from thymine-rich genomic sequences

Nature Communications ◽

10.1038/s41467-020-14500-z ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 7

Author(s):

Nikolaos Vakirlis ◽

Omer Acar ◽

Brian Hsu ◽

Nelson Castilho Coelho ◽

S. Branden Van Oss ◽

...

Keyword(s):

Natural Selection ◽

De Novo ◽

Natural Populations ◽

Evolutionary Model ◽

Transmembrane Domains ◽

Protein Coding ◽

Fitness Effects ◽

Evolutionary Innovation ◽

Intergenic Regions ◽

Novel Protein

AbstractRecent evidence demonstrates that novel protein-coding genes can arise de novo from non-genic loci. This evolutionary innovation is thought to be facilitated by the pervasive translation of non-genic transcripts, which exposes a reservoir of variable polypeptides to natural selection. Here, we systematically characterize how these de novo emerging coding sequences impact fitness in budding yeast. Disruption of emerging sequences is generally inconsequential for fitness in the laboratory and in natural populations. Overexpression of emerging sequences, however, is enriched in adaptive fitness effects compared to overexpression of established genes. We find that adaptive emerging sequences tend to encode putative transmembrane domains, and that thymine-rich intergenic regions harbor a widespread potential to produce transmembrane domains. These findings, together with in-depth examination of the de novo emerging YBR196C-A locus, suggest a novel evolutionary model whereby adaptive transmembrane polypeptides emerge de novo from thymine-rich non-genic regions and subsequently accumulate changes molded by natural selection.

A high-quality chromosome-level genome assembly reveals genetics for important traits in eggplant

Horticulture Research ◽

10.1038/s41438-020-00391-0 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Qingzhen Wei ◽

Jinglei Wang ◽

Wuhong Wang ◽

Tianhua Hu ◽

Haijiao Hu ◽

...

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

Repetitive Sequences ◽

Gene Families ◽

Specific Gene ◽

High Quality ◽

Total Size ◽

Protein Coding ◽

Fruit Length ◽

Protein Coding Genes

Abstract Eggplant (Solanum melongena L.) is an economically important vegetable crop in the Solanaceae family, with extensive diversity among landraces and close relatives. Here, we report a high-quality reference genome for the eggplant inbred line HQ-1315 (S. melongena-HQ) using a combination of Illumina, Nanopore and 10X genomics sequencing technologies and Hi-C technology for genome assembly. The assembled genome has a total size of ~1.17 Gb and 12 chromosomes, with a contig N50 of 5.26 Mb, consisting of 36,582 protein-coding genes. Repetitive sequences comprise 70.09% (811.14 Mb) of the eggplant genome, most of which are long terminal repeat (LTR) retrotransposons (65.80%), followed by long interspersed nuclear elements (LINEs, 1.54%) and DNA transposons (0.85%). The S. melongena-HQ eggplant genome carries a total of 563 accession-specific gene families containing 1009 genes. In total, 73 expanded gene families (892 genes) and 34 contraction gene families (114 genes) were functionally annotated. Comparative analysis of different eggplant genomes identified three types of variations, including single-nucleotide polymorphisms (SNPs), insertions/deletions (indels) and structural variants (SVs). Asymmetric SV accumulation was found in potential regulatory regions of protein-coding genes among the different eggplant genomes. Furthermore, we performed QTL-seq for eggplant fruit length using the S. melongena-HQ reference genome and detected a QTL interval of 71.29–78.26 Mb on chromosome E03. The gene Smechr0301963, which belongs to the SUN gene family, is predicted to be a key candidate gene for eggplant fruit length regulation. Moreover, we anchored a total of 210 linkage markers associated with 71 traits to the eggplant chromosomes and finally obtained 26 QTL hotspots. The eggplant HQ-1315 genome assembly can be accessed at http://eggplant-hq.cn. In conclusion, the eggplant genome presented herein provides a global view of genomic divergence at the whole-genome level and powerful tools for the identification of candidate genes for important traits in eggplant.