Comparable Number of Genes Having Experienced Positive Selection among Great Ape Species

Alleles that cause advantageous phenotypes with positive selection contribute to adaptive evolution. Investigations of positive selection in protein-coding genes rely on the accuracy of orthology, models, the quality of assemblies, and alignment. Here, based on the latest genome assemblies and gene annotations, we present a comparative analysis on positive selection in four great ape species and identify 211 high-confidence positively selected genes (PSGs). Even the differences in population size among these closely related great apes have resulted in differences in their ability to remove deleterious alleles and to adapt to changing environments, we found that they experienced comparable numbers of positive selection. We also uncovered that more than half of multigene families exhibited signals of positive selection, suggesting that imbalanced positive selection resulted in the functional divergence of duplicates. Moreover, at the expression level, although positive selection led to a more non-uniform pattern across tissues, the correlation between positive selection and expression patterns is diverse. Overall, this updated list of PSGs is of great significance for the further study of the phenotypic evolution in great apes.

Download Full-text

Liftoff: accurate mapping of gene annotations

Bioinformatics ◽

10.1093/bioinformatics/btaa1016 ◽

2020 ◽

Author(s):

Alaina Shumate ◽

Steven L Salzberg

Keyword(s):

Reference Genome ◽

Supplementary Information ◽

Closely Related Species ◽

Protein Coding ◽

Human Reference Genome ◽

Sequence Identity ◽

Gene Annotations ◽

Genome Assemblies ◽

Average Sequence Identity ◽

High Quality Genome

Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and Implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Evolutionary Rate Heterogeneity and Functional Divergence of Orthologous Genes in Pyrus

Biomolecules ◽

10.3390/biom9090490 ◽

2019 ◽

Vol 9 (9) ◽

pp. 490 ◽

Cited By ~ 2

Author(s):

Yunpeng Cao ◽

Lan Jiang ◽

Lihu Wang ◽

Yongping Cai

Keyword(s):

Expression Profiles ◽

Functional Divergence ◽

Expression Patterns ◽

Synonymous Substitution ◽

Evolutionary Rates ◽

Orthologous Genes ◽

Protein Coding ◽

Different Types ◽

Synonymous Substitution Rates ◽

Gene Structures

Negatively selected genes (NSGs) and positively selected genes (PSGs) are the two types of most nuclear protein-coding genes in organisms. However, the evolutionary rates and characteristics of different types of genes have been rarely understood. In the present study, we investigate the rates of synonymous substitution (Ks) and the rates of non-synonymous substitution (Ka) by comparing the orthologous genes of two sequenced Pyrus species, Pyrus bretschneideri and Pyrus communis. Subsequently, we compared the evolutionary rates, gene structures, and expression profiles during different fruit development between PSGs and NSGs. Compared with the NSGs, the PSGs have fewer exons, shorter gene length, lower synonymous substitution rates and have higher evolutionary rates. Remarkably, gene expression patterns between two Pyrus species fruit indicated functional divergence for most of the orthologous genes derived from a common ancestor, and subfunctionalization for some of them. Overall, the present study shows that PSGs differs from NSGs not only under environmental selective pressure (Ka/Ks), but also in their structural, functional, and evolutionary properties. Additionally, our resulting data provides important insights for the evolution and highlights the diversification of orthologous genes in two Pyrus species.

Download Full-text

Differential Expression and Function of the CD33-Related Siglecs between Humans and Great Apes.

Blood ◽

10.1182/blood.v104.11.1466.1466 ◽

2004 ◽

Vol 104 (11) ◽

pp. 1466-1466

Author(s):

Nancy Hurtado-Ziola ◽

Justin L. Sonnenburg ◽

Ajit Varki

Keyword(s):

Sialic Acid ◽

Blood Cells ◽

Expression Patterns ◽

Great Apes ◽

Sialic Acids ◽

Innate Immune ◽

Immunoglobulin Superfamily ◽

Last Common Ancestor ◽

Great Ape ◽

Sialic Acid Binding

Abstract The Siglecs (Sialic acid-binding Immunoglobulin Superfamily Lectins) are a recently discovered family of mammalian glycan-binding proteins that have been shown to recognize the terminal sialic acids of glycoproteins and glycolipids. The CD33-Related Siglecs (CD33rSiglecs, namely Siglec-3, -5 through -11 and -XII in humans) are a subgroup of these molecules, which are thought to be primarily expressed on cells of the innate immune system. All CD33rSiglecs are type-1 transmembrane proteins with an N-terminal sialic acid-recognizing V-set domain followed by a variable number of C-2 set domains, a transmembrane region and a cytosolic C-terminal domain that usually has two tyrosine-based signaling motifs, one of which conforms to a canonical negative regulatory ITIM motif. Although the true function of the CD33rSiglecs has yet to be discovered, available data are most consistent with an inhibitory signaling role in the innate immune response, mediated by recognition of host sialic acids as “self”. CD33rSiglecs also interact with sialic acids on the same cell surface, typically resulting in “masking” of their sialic acid-binding sites. Our recent studies have shown that humans and non-human primates have a similar clustered localization of CD33rSiglec genes, and that true orthologs can generally be identified within each cluster (Angata et al., PNAS, in press). However, humans no longer express CMP-sialic acid hydroxylase (CMAH) the enzyme required to generate one of the potential CD33rSiglec sialic acid ligands called N-glycolylneuraminic acid (Neu5Gc), from its precursor N-acetylneuraminic acid (Neu5Ac). This genetic change occurred after our last common ancestor with the great apes, and dramatically altered the “Sialome” (the sialic acid makeup of a specific species) of humans when compared to that of the great apes. While great ape blood cells express about equal amounts of Neu5Ac and Neu5Gc, human blood cells express almost exclusively Neu5Ac. We also recently discovered that preferential recognition of Neu5Gc is the ancestral condition of most or all of the great ape (chimpanzee and gorilla) CD33rSiglecs (Sonnenburg JL, Altheide TK, Varki A. Glycobiology.14:339–46, 2004). We therefore reasoned that the sudden and major change in the sialome of our hominid ancestors could have had a significant impact on the evolution, binding specificities and expression patterns of CD33rSiglecs. Indeed, we have found that all human CD33rSiglecs can recognize both Neu5Ac and Neu5Gc. This presumably represents an evolutionarily-selected “relaxation” in binding specificity that was necessary to “remask” the Siglecs that had lost their Neu5Gc ligands. Also, there are differences in CD33rSiglec expression on monocytes and neutrophils between humans and great apes (chimp, bonobo, gorilla and orangutan). Furthermore, while great ape cells often show multiple populations with different signal intensities, humans express a single bright peak for each Siglec in flow cytometry. Surprisingly, while humans showed almost no CD33rSiglec expression on lymphocytes, the great apes show a moderate to high expression of some Siglecs on these cells. Total leukocyte expression of some CD33rSiglecs also shows differences between humans and great apes. Overall, CD33rSiglecs appear to be rapidly evolving in primates, with an apparent further acceleration of changes in humans. Additional studies are needed to define the mechanistic details, as well as the implications for human health and disease.

Download Full-text

High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies

Molecular Biology and Evolution ◽

10.1093/molbev/msz156 ◽

2019 ◽

Vol 36 (11) ◽

pp. 2415-2431 ◽

Cited By ~ 10

Author(s):

Monika Cechova ◽

Robert S Harris ◽

Marta Tomaszkiewicz ◽

Barbara Arbeithuber ◽

Francesca Chiaromonte ◽

...

Keyword(s):

Great Apes ◽

Great Ape ◽

Satellite Repeat ◽

Satellite Sequences ◽

Satellite Repeats ◽

Repetitive Nature ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

AbstractSatellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.

Download Full-text

Lost in Translation: The Pitfalls of Ensembl Gene Annotations Between Human Genome Assemblies and Their Impact on Diagnostics

10.21203/rs.3.rs-131927/v1 ◽

2020 ◽

Author(s):

Mohammed O.E Abdallah ◽

Mahmoud Koko ◽

Raj Ramesar

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Evolutionary Constraint ◽

Clinical Genetics ◽

Ensembl Gene ◽

Protein Coding ◽

Gene Annotations ◽

Human Genome Assembly ◽

Gene Models ◽

Genome Assemblies

Abstract Background:The GRCh37 human genome assembly is still widely used in genomics despite the fact an updated human genome assembly (GRCh38) has been available for many years. A particular issue with relevant ramifications for clinical genetics currently is the case of the GRCh37 Ensembl gene annotations which has been archived, and thus not updated, since 2013. These Ensembl GRCh37 gene annotations are just as ubiquitous as the former assembly and are the default gene models used and preferred by the majority of genomic projects internationally. In this study, we highlight the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly. These genes are ignored by all genomic resources that still rely on the archived and outdated gene annotations. Moreover, the majority if not all of these discrepant genes (DGs) are automatically discarded and ignored by all variant prioritization tools that rely on the GRCh37 Ensembl gene annotations.Methods:We performed bioinformatics analysis identifying Ensembl genes with discrepant annotations between the two most recent human genome assemblies, hg37, hg38, respectively. Clinical and phenotype gene curations have been obtained and compared for this gene set. Furthermore, matching RefSeq transcripts have also been collated and analyzed. ٌResults:We found hundreds of genes (N=267) that were reclassified as “protein-coding” in the new hg38 assembly. Notably, 169 of these genes also had a discrepant HGNC gene symbol between the two assemblies.Most genes had RefSeq matches (N=199/267) including all the genes with defined phenotypes in Ensembl genes GRCh38 assembly (N=10). However, many protein-coding genes remain missing from the current known RefSeq gene models (N=68)Conclusion: We found many clinically relevant genes in this group of neglected genes and we anticipate that many more will be found relevant in the future. For these genes, the inaccurate label of “non-protein-coding” hinders the possibility of identifying any causal sequence variants that overlap them. In addition, Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes for the same reason, further relegating them into oblivion.

Download Full-text

Comparative genomics of muskmelon reveals a potential role for retrotransposons in the modification of gene expression

Communications Biology ◽

10.1038/s42003-020-01172-0 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Ryoichi Yano ◽

Tohru Ariizumi ◽

Satoko Nonaka ◽

Yoichi Kawazu ◽

Silin Zhong ◽

...

Keyword(s):

Gene Expression ◽

Fruit Ripening ◽

Expression Patterns ◽

Rna Seq ◽

Protein Coding ◽

Genome Wide ◽

Oxford Nanopore ◽

Melon Genome ◽

Number Variation ◽

Genome Assemblies

AbstractMelon exhibits substantial natural variation especially in fruit ripening physiology, including both climacteric (ethylene-producing) and non-climacteric types. However, genomic mechanisms underlying such variation are not yet fully understood. Here, we report an Oxford Nanopore-based high-grade genome reference in the semi-climacteric cultivar Harukei-3 (378 Mb + 33,829 protein-coding genes), with an update of tissue-wide RNA-seq atlas in the Melonet-DB database. Comparison between Harukei-3 and DHL92, the first published melon genome, enabled identification of 24,758 one-to-one orthologue gene pairs, whereas others were candidates of copy number variation or presence/absence polymorphisms (PAPs). Further comparison based on 10 melon genome assemblies identified genome-wide PAPs of 415 retrotransposon Gag-like sequences. Of these, 160 showed fruit ripening-inducible expression, with 59.4% of the neighboring genes showing similar expression patterns (r > 0.8). Our results suggest that retrotransposons contributed to the modification of gene expression during diversification of melon genomes, and may affect fruit ripening-inducible gene expression.

Download Full-text

Comparative analysis of de novo genomes reveals dynamic intra-species divergence of NLRs in pepper

BMC Plant Biology ◽

10.1186/s12870-021-03057-8 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Myung-Shin Kim ◽

Geun Young Chae ◽

Soohyun Oh ◽

Jihyun Kim ◽

Hyunggon Mang ◽

...

Keyword(s):

Capsicum Annuum ◽

De Novo ◽

Genomic Diversity ◽

Specific Gene ◽

Protein Coding ◽

Genomic Variations ◽

Gene Annotations ◽

Small Fruit ◽

Number Variation ◽

Genome Assemblies

Abstract Background Peppers (Capsicum annuum L.) containing distinct capsaicinoids are the most widely cultivated spices in the world. However, extreme genomic diversity among species represents an obstacle to breeding pepper. Results Here, we report de novo genome assemblies of Capsicum annuum ‘Early Calwonder (non-pungent, ECW)’ and ‘Small Fruit (pungent, SF)’ along with their annotations. In total, we assembled 2.9 Gb of ECW and SF genome sequences, representing over 91% of the estimated genome sizes. Structural and functional annotation of the two pepper genomes generated about 35,000 protein-coding genes each, of which 93% were assigned putative functions. Comparison between newly and publicly available pepper gene annotations revealed both shared and specific gene content. In addition, a comprehensive analysis of nucleotide-binding and leucine-rich repeat (NLR) genes through whole-genome alignment identified five significant regions of NLR copy number variation (CNV). Detailed comparisons of those regions revealed that these CNVs were generated by intra-specific genomic variations that accelerated diversification of NLRs among peppers. Conclusions Our analyses unveil an evolutionary mechanism responsible for generating CNVs of NLRs among pepper accessions, and provide novel genomic resources for functional genomics and molecular breeding of disease resistance in Capsicum species.

Download Full-text

High satellite repeat turnover in great apes studied with short- and long-read technologies

10.1101/470054 ◽

2018 ◽

Cited By ~ 1

Author(s):

Monika Cechova ◽

Robert S. Harris ◽

Marta Tomaszkiewicz ◽

Barbara Arbeithuber ◽

Francesca Chiaromonte ◽

...

Keyword(s):

Great Apes ◽

Great Ape ◽

Satellite Repeat ◽

Satellite Sequences ◽

Satellite Repeats ◽

Repetitive Nature ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

AbstractSatellite repeats are a structural component of centromeres and telomeres, and in some instances their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: (1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and (2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males vs. females; using Y chromosome assemblies or FIuorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.

Download Full-text

Lost in translation: the pitfalls of Ensembl Gene annotations between human genome assemblies and their impact on diagnostics

10.1101/2020.11.12.380295 ◽

2020 ◽

Author(s):

Mohammed O.E Abdallah ◽

Mahmoud Koko ◽

Raj Ramesar

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Evolutionary Constraint ◽

Clinical Genetics ◽

Ensembl Gene ◽

Protein Coding ◽

Gene Annotations ◽

Human Genome Assembly ◽

Gene Models ◽

Genome Assemblies

AbstractBackgroundThe GRCh37 human genome assembly is still widely used in genomics despite the fact an updated human genome assembly (GRCh38) has been available for many years. A particular issue with relevant ramifications for clinical genetics currently is the case of the GRCh37 Ensembl gene annotations which has been archived, and thus not updated, since 2013. These Ensembl GRCh37 gene annotations are just as ubiquitous as the former assembly and are the default gene models used and preferred by the majority of genomic projects internationally. In this study, we highlight the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly. These genes are ignored by all genomic resources that still rely on the archived and outdated gene annotations. Moreover, the majority if not all of these discrepant genes (DGs) are automatically discarded and ignored by all variant prioritization tools that rely on the GRCh37 Ensembl gene annotations.MethodsWe performed bioinformatics analysis identifying Ensembl genes with discrepant annotations between the two most recent human genome assemblies, hg37, hg38, respectively. Clinical and phenotype gene curations have been obtained and compared for this gene set. Furthermore, matching RefSeq transcripts have also been collated and analyzed.ResultsWe found hundreds of genes (N=267) that were reclassified as “protein-coding” in the new hg38 assembly. Notably, 169 of these genes also had a discrepant HGNC gene symbol between the two assemblies. Most genes had RefSeq matches (N=199/267) including all the genes with defined phenotypes in Ensembl genes GRCh38 assembly (N=10). However, many protein-coding genes remain missing from the current known RefSeq gene models (N=68)ConclusionWe found many clinically relevant genes in this group of neglected genes and we anticipate that many more will be found relevant in the future. For these genes, the inaccurate label of “non-protein-coding” hinders the possibility of identifying any causal sequence variants that overlap them. In addition, Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes for the same reason, further relegating them into oblivion.

Download Full-text

Chromosomal assembly of the nuclear genome of the endosymbiont-bearing trypanosomatid Angomonas deanei

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkaa018 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

John W Davey ◽

Carolina M C Catta-Preta ◽

Sally James ◽

Sarah Forrester ◽

Maria Cristina M Motta ◽

...

Keyword(s):

Chromosome Number ◽

Noncoding Rnas ◽

Nuclear Genome ◽

Supernumerary Chromosome ◽

Ribosomal Rnas ◽

Protein Coding ◽

Transfer Rnas ◽

Protein Coding Genes ◽

Oxford Nanopore ◽

Genome Assemblies

Abstract Angomonas deanei is an endosymbiont-bearing trypanosomatid with several highly fragmented genome assemblies and unknown chromosome number. We present an assembly of the A. deanei nuclear genome based on Oxford Nanopore sequence that resolves into 29 complete or close-to-complete chromosomes. The assembly has several previously unknown special features; it has a supernumerary chromosome, a chromosome with a 340-kb inversion, and there is a translocation between two chromosomes. We also present an updated annotation of the chromosomal genome with 10,365 protein-coding genes, 59 transfer RNAs, 26 ribosomal RNAs, and 62 noncoding RNAs.

Download Full-text