A Continuum of Evolving De Novo Genes Drives Protein-Coding Novelty in Drosophila

Comparative genomics has enabled the identification of genes that potentially evolved de novo from non-coding sequences. Many such genes are expressed in male reproductive tissues, but their functions remain poorly understood. To address this, we conducted a functional genetic screen of over 40 putative de novo genes with testis-enriched expression in Drosophila melanogaster and identified one gene, atlas, required for male fertility. Detailed genetic and cytological analyses show that atlas is required for proper chromatin condensation during the final stages of spermatogenesis. Atlas protein is expressed in spermatid nuclei and facilitates the transition from histone- to protamine-based chromatin packaging. Complementary evolutionary analyses revealed the complex evolutionary history of atlas. The protein-coding portion of the gene likely arose at the base of the Drosophila genus on the X chromosome but was unlikely to be essential, as it was then lost in several independent lineages. Within the last ~15 million years, however, the gene moved to an autosome, where it fused with a conserved non-coding RNA and evolved a non-redundant role in male fertility. Altogether, this study provides insight into the integration of novel genes into biological processes, the links between genomic innovation and functional evolution, and the genetic control of a fundamental developmental process, gametogenesis.

Download Full-text

From de novo to ‘de nono’: most novel protein coding genes identified with phylostratigraphy represent old genes or recent duplicates

10.1101/287193 ◽

2018 ◽

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Gc Content ◽

Protein Coding ◽

Protein Coding Genes ◽

Gene Sets ◽

De Novo Genes ◽

De Novo Gene ◽

Similarity Searches ◽

Novel Protein

AbstractThe evolution of novel protein-coding genes from noncoding regions of the genome is one of the most compelling evidence for genetic innovations in nature. One popular approach to identify de novo genes is phylostratigraphy, which consists of determining the approximate time of origin (age) of a gene based on its distribution along a species phylogeny. Several studies have revealed significant flaws in determining the age of genes, including de novo genes, using phylostratigraphy alone. However, the rate of false positives in de novo gene surveys, based on phylostratigraphy, remains unknown. Here, I re-analyze the findings from three studies, two of which identified tens to hundreds of rodent-specific de novo genes adopting a phylostratigraphy-centered approach. Most of the putative de novo genes discovered in these investigations are no longer included in recently updated mouse gene sets. Using a combination of synteny information and sequence similarity searches, I show that about 60% of the remaining 381 putative de novo genes share homology with genes from other vertebrates, originated through gene duplication, and/or share no synteny information with non-rodent mammals. These results led to an estimated rate of ∼12 de novo genes per million year in mouse. Contrary to a previous study (Wilson et al. 2017), I found no evidence supporting the preadaptation hypothesis of de novo gene formation. Nearly half of the de novo genes confirmed in this study are within older genes, indicating that co-option of preexisting regulatory regions and a higher GC content may facilitate the origin of novel genes.

Download Full-text

Only a Single Taxonomically Restricted Gene Family in the Drosophila melanogaster Subgroup Can Be Identified with High Confidence

Genome Biology and Evolution ◽

10.1093/gbe/evaa127 ◽

2020 ◽

Vol 12 (8) ◽

pp. 1355-1366

Author(s):

Karina Zile ◽

Christophe Dessimoz ◽

Yannick Wurm ◽

Joanna Masel

Keyword(s):

Drosophila Melanogaster ◽

De Novo ◽

Experimental Studies ◽

High Confidence ◽

Protein Coding ◽

Noncoding Sequences ◽

De Novo Genes ◽

Intergenic Sequences ◽

Drosophila Melanogaster Subgroup ◽

Reading Frames

Abstract Taxonomically restricted genes (TRGs) are genes that are present only in one clade. Protein-coding TRGs may evolve de novo from previously noncoding sequences: functional ncRNA, introns, or alternative reading frames of older protein-coding genes, or intergenic sequences. A major challenge in studying de novo genes is the need to avoid both false-positives (nonfunctional open reading frames and/or functional genes that did not arise de novo) and false-negatives. Here, we search conservatively for high-confidence TRGs as the most promising candidates for experimental studies, ensuring functionality through conservation across at least two species, and ensuring de novo status through examination of homologous noncoding sequences. Our pipeline also avoids ascertainment biases associated with preconceptions of how de novo genes are born. We identify one TRG family that evolved de novo in the Drosophila melanogaster subgroup. This TRG family contains single-copy genes in Drosophila simulans and Drosophila sechellia. It originated in an intron of a well-established gene, sharing that intron with another well-established gene upstream. These TRGs contain an intron that predates their open reading frame. These genes have not been previously reported as de novo originated, and to our knowledge, they are the best Drosophila candidates identified so far for experimental studies aimed at elucidating the properties of de novo genes.

Download Full-text

New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2014.0332 ◽

2015 ◽

Vol 370 (1678) ◽

pp. 20140332 ◽

Cited By ~ 72

Author(s):

Aoife McLysaght ◽

Daniele Guerzoni

Keyword(s):

De Novo ◽

Complex Structure ◽

Protein Coding ◽

Functional Roles ◽

Protein Coding Genes ◽

Evolutionary Innovation ◽

New Genes ◽

De Novo Genes ◽

Novel Protein

The origin of novel protein-coding genes de novo was once considered so improbable as to be impossible. In less than a decade, and especially in the last five years, this view has been overturned by extensive evidence from diverse eukaryotic lineages. There is now evidence that this mechanism has contributed a significant number of genes to genomes of organisms as diverse as Saccharomyces , Drosophila , Plasmodium , Arabidopisis and human. From simple beginnings, these genes have in some instances acquired complex structure, regulated expression and important functional roles. New genes are often thought of as dispensable late additions; however, some recent de novo genes in human can play a role in disease. Rather than an extremely rare occurrence, it is now evident that there is a relatively constant trickle of proto-genes released into the testing ground of natural selection. It is currently unknown whether de novo genes arise primarily through an ‘RNA-first’ or ‘ORF-first’ pathway. Either way, evolutionary tinkering with this pool of genetic potential may have been a significant player in the origins of lineage-specific traits and adaptations.

Download Full-text

Insights into the Genome Sequence ofChromobacterium amazonenseIsolated from a Tropical Freshwater Lake

International Journal of Genomics ◽

10.1155/2018/1062716 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10

Author(s):

Alexandre Bueno Santos ◽

Patrícia Silva Costa ◽

Anderson Oliveira do Carmo ◽

Gabriel da Rocha Fernandes ◽

Larissa Lopes Silva Scholte ◽

...

Keyword(s):

De Novo ◽

Genomic Diversity ◽

Protein Coding ◽

Biotechnological Potential ◽

Draft Assembly ◽

Functional Studies ◽

Alpha Hemolysin ◽

Type Iv ◽

Nudix Hydrolases ◽

Colicin V

Members of the genusChromobacteriumhave been isolated from geographically diverse ecosystems and exhibit considerable metabolic flexibility, as well as biotechnological and pathogenic properties in some species. This study reports the draft assembly and detailed sequence analysis ofChromobacterium amazonensestrain 56AF. The de novo-assembled genome is 4,556,707 bp in size and contains 4294 protein-coding and 95 RNA genes, including 88 tRNA, six rRNA, and one tmRNA operon. A repertoire of genes implicated in virulence, for example, hemolysin, hemolytic enterotoxins, colicin V, lytic proteins, and Nudix hydrolases, is present. The genome also contains a collection of genes of biotechnological interest, including esterases, lipase, auxins, chitinases, phytoene synthase and phytoene desaturase, polyhydroxyalkanoates, violacein, plastocyanin/azurin, and detoxifying compounds. Importantly, unlike otherChromobacteriumspecies, the 56AF genome contains genes for pore-forming toxin alpha-hemolysin, a type IV secretion system, among others. The analysis of theC. amazonensestrain 56AF genome reveals the versatility, adaptability, and biotechnological potential of this bacterium. This study provides molecular information that may pave the way for further comparative genomics and functional studies involvingChromobacterium-related isolates and improves our understanding of the global genomic diversity ofChromobacteriumspecies.

Download Full-text

CALINCA—A Novel Pipeline for the Identification of lncRNAs in Podocyte Disease

Cells ◽

10.3390/cells10030692 ◽

2021 ◽

Vol 10 (3) ◽

pp. 692

Author(s):

Sweta Talyan ◽

Samantha Filipów ◽

Michael Ignarski ◽

Magdalena Smieszek ◽

He Chen ◽

...

Keyword(s):

Cell Biology ◽

Mammalian Cells ◽

De Novo ◽

Depth Information ◽

Gene Products ◽

Classical Analysis ◽

Protein Coding ◽

Bioinformatic Pipeline ◽

Non Coding Rnas ◽

Filtration Unit

Diseases of the renal filtration unit—the glomerulus—are the most common cause of chronic kidney disease. Podocytes are the pivotal cell type for the function of this filter and focal-segmental glomerulosclerosis (FSGS) is a classic example of a podocytopathy leading to proteinuria and glomerular scarring. Currently, no targeted treatment of FSGS is available. This lack of therapeutic strategies is explained by a limited understanding of the defects in podocyte cell biology leading to FSGS. To date, most studies in the field have focused on protein-coding genes and their gene products. However, more than 80% of all transcripts produced by mammalian cells are actually non-coding. Here, long non-coding RNAs (lncRNAs) are a relatively novel class of transcripts and have not been systematically studied in FSGS to date. The appropriate tools to facilitate lncRNA research for the renal scientific community are urgently required due to a row of challenges compared to classical analysis pipelines optimized for coding RNA expression analysis. Here, we present the bioinformatic pipeline CALINCA as a solution for this problem. CALINCA automatically analyzes datasets from murine FSGS models and quantifies both annotated and de novo assembled lncRNAs. In addition, the tool provides in-depth information on podocyte specificity of these lncRNAs, as well as evolutionary conservation and expression in human datasets making this pipeline a crucial basis to lncRNA studies in FSGS.

Download Full-text

From de novo to ‘de nono’: The majority of novel protein coding genes identified with phylostratigraphy are old genes or recent duplicates

Genome Biology and Evolution ◽

10.1093/gbe/evy231 ◽

2018 ◽

Cited By ~ 2

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Protein Coding ◽

Protein Coding Genes ◽

Novel Protein

Download Full-text

EnTAP: Bringing Faster and Smarter Functional Annotation to Non-Model Eukaryotic Transcriptomes

10.1101/307868 ◽

2018 ◽

Cited By ~ 5

Author(s):

Alexander J. Hart ◽

Samuel Ginzburg ◽

Muyang (Sam) Xu ◽

Cera R. Fisher ◽

Nasim Rahmatpour ◽

...

Keyword(s):

Similarity Search ◽

De Novo ◽

Gene Annotation ◽

Enrichment Analysis ◽

Orthologous Gene ◽

Protein Domain ◽

Family Assessment ◽

Ontology Term ◽

Protein Coding ◽

Functional Gene Annotation

ABSTRACTEnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Download Full-text

Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome

10.1101/847558 ◽

2019 ◽

Author(s):

Ryan Bracewell ◽

Anita Tran ◽

Kamalakar Chatla ◽

Doris Bachtrog

Keyword(s):

X Chromosome ◽

Genome Assembly ◽

De Novo ◽

Pericentromeric Region ◽

Species Group ◽

Chromosome 15 ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Read ◽

Chromosome Level

ABSTRACTThe Drosophila obscura species group is one of the most studied clades of Drosophila and harbors multiple distinct karyotypes. Here we present a de novo genome assembly and annotation of D. bifasciata, a species which represents an important subgroup for which no high-quality chromosome-level genome assembly currently exists. We combined long-read sequencing (Nanopore) and Hi-C scaffolding to achieve a highly contiguous genome assembly approximately 193Mb in size, with repetitive elements constituting 30.1% of the total length. Drosophila bifasciata harbors four large metacentric chromosomes and the small dot, and our assembly contains each chromosome in a single scaffold, including the highly repetitive pericentromere, which were largely composed of Jockey and Gypsy transposable elements. We annotated a total of 12,821 protein-coding genes and comparisons of synteny with D. athabasca orthologs show that the large metacentric pericentromeric regions of multiple chromosomes are conserved between these species. Importantly, Muller A (X chromosome) was found to be metacentric in D. bifasciata and the pericentromeric region appears homologous to the pericentromeric region of the fused Muller A-AD (XL and XR) of pseudoobscura/affinis subgroup species. Our finding suggests a metacentric ancestral X fused to a telocentric Muller D and created the large neo-X (Muller A-AD) chromosome ∼15 MYA. We also confirm the fusion of Muller C and D in D. bifasciata and show that it likely involved a centromere-centromere fusion.

Download Full-text

Genomic Analysis of Sarcomyxa edulis Reveals the Basis of Its Medicinal Properties and Evolutionary Relationships

Frontiers in Microbiology ◽

10.3389/fmicb.2021.652324 ◽

2021 ◽

Vol 12 ◽

Author(s):

Fenghua Tian ◽

Changtian Li ◽

Yu Li

Keyword(s):

Single Molecule ◽

De Novo ◽

Genomic Analysis ◽

Single Copy ◽

Whole Genome Sequence ◽

Type I ◽

Whole Genome ◽

Uridine Diphosphate ◽

Protein Coding ◽

Medicinal Value

Yuanmo [Sarcomyxa edulis (Y.C. Dai, Niemelä & G.F. Qin) T. Saito, Tonouchi & T. Harada] is an important edible and medicinal mushroom endemic to Northeastern China. Here we report the de novo sequencing and assembly of the S. edulis genome using single-molecule real-time sequencing technology. The whole genome was approximately 35.65 Mb, with a G + C content of 48.31%. Genome assembly generated 41 contigs with an N50 length of 1,772,559 bp. The genome comprised 9,364 annotated protein-coding genes, many of which encoded enzymes involved in the modification, biosynthesis, and degradation of glycoconjugates and carbohydrates or enzymes predicted to be involved in the biosynthesis of secondary metabolites such as terpene, type I polyketide, siderophore, and fatty acids, which are responsible for the pharmacodynamic activities of S. edulis. We also identified genes encoding 1,3-β-glucan synthase and endo-1,3(4)-β-glucanase, which are involved in polysaccharide and uridine diphosphate glucose biosynthesis. Phylogenetic and comparative analyses of Basidiomycota fungi based on a single-copy orthologous protein indicated that the Sarcomyxa genus is an independent group that evolved from the Pleurotaceae family. The annotated whole-genome sequence of S. edulis can serve as a reference for investigations of bioactive compounds with medicinal value and the development and commercial production of superior S. edulis varieties.

Download Full-text