Large-scale analysis of SARS-CoV-2 spike-glycoprotein mutants demonstrates the need for continuous screening of virus isolates

Due to the widespread of the COVID-19 pandemic, the SARS-CoV-2 genome is evolving in diverse human populations. Several studies already reported different strains and an increase in the mutation rate. Particularly, mutations in SARS-CoV-2 spike-glycoprotein are of great interest as it mediates infection in human and recently approved mRNA vaccines are designed to induce immune responses against it. We analyzed 1,036,030 SARS-CoV-2 genome assemblies and 30,806 NGS datasets from GISAID and European Nucleotide Archive (ENA) focusing on non-synonymous mutations in the spike protein. Only around 2.5% of the samples contained the wild-type spike protein with no variation from the reference. Among the spike protein mutants, we confirmed a low mutation rate exhibiting less than 10 non-synonymous mutations in 99.6% of the analyzed sequences, but the mean and median number of spike protein mutations per sample increased over time. 5,472 distinct variants were found in total. The majority of the observed variants were recurrent, but only 21 and 14 recurrent variants were found in at least 1% of the mutant genome assemblies and NGS samples, respectively. Further, we found high-confidence subclonal variants in about 2.6% of the NGS data sets with mutant spike protein, which might indicate co-infection with various SARS-CoV-2 strains and/or intra-host evolution. Lastly, some variants might have an effect on antibody binding or T-cell recognition. These findings demonstrate the continuous importance of monitoring SARS-CoV-2 sequences for an early detection of variants that require adaptations in preventive and therapeutic strategies.

Download Full-text

Large-scale analysis of SARS-CoV-2 spike-glycoprotein mutants demonstrates the need for continuous screening of virus isolates

10.1101/2021.02.04.429765 ◽

2021 ◽

Author(s):

Barbara Schrörs ◽

Ranganath Gudimella ◽

Thomas Bukur ◽

Thomas Rösler ◽

Martin Löwer ◽

...

Keyword(s):

Mutation Rate ◽

Large Scale ◽

Median Number ◽

Spike Protein ◽

Human Populations ◽

Data Sets ◽

Spike Glycoprotein ◽

Synonymous Mutations ◽

Different Strains ◽

Genome Assemblies

AbstractDue to the widespread of the COVID-19 pandemic, the SARS-CoV-2 genome is evolving in diverse human populations. Several studies already reported different strains and an increase in the mutation rate. Particularly, mutations in SARS-CoV-2 spike-glycoprotein are of great interest as it mediates infection in human and recently approved mRNA vaccines are designed to induce immune responses against it.We analyzed 146,920 SARS-CoV-2 genome assemblies and 2,393 NGS datasets from GISAID, NCBI Virus and NCBI SRA archives focusing on non-synonymous mutations in the spike protein. Only around 13.6% of the samples contained the wild-type spike protein with no variation from the reference. Among the spike protein mutants, we confirmed a low mutation rate exhibiting less than 10 non-synonymous mutations in 99.98% of the analyzed sequences, but the mean and median number of spike protein mutations per sample increased over time. 2,592 distinct variants were found in total. The majority of the observed variants were recurrent, but only nine and 23 recurrent variants were found in at least 0.5% of the mutant genome assemblies and NGS samples, respectively. Further, we found high-confidence subclonal variants in about 15.1% of the NGS data sets with mutant spike protein, which might indicate co-infection with various SARS-CoV-2 strains and/or intra-host evolution. Lastly, some variants might have an effect on antibody binding or T-cell recognition.These findings demonstrate the increasing importance of monitoring SARS-CoV-2 sequences for an early detection of variants that require adaptations in preventive and therapeutic strategies.

Download Full-text

The impact of recent population history on the deleterious mutation load in humans and close evolutionary relatives

10.1101/073635 ◽

2016 ◽

Author(s):

Yuval B. Simons ◽

Guy Sella

Keyword(s):

Demographic History ◽

Population History ◽

Human Populations ◽

Data Sets ◽

Loss Of Function ◽

Mutation Load ◽

Synonymous Mutations ◽

Show Evidence ◽

Out Of Africa ◽

The Impact

AbstractOver the past decade, there has been both great interest and confusion about whether recent demographic events—notably the Out-of-Africa-bottleneck and recent population growth—have led to differences in mutation load among human populations. The confusion can be traced to the use of different summary statistics to measure load, which lead to apparently conflicting results. We argue, however, that when statistics more directly related to load are used, the results of different studies and data sets consistently reveal little or no difference in the load of non-synonymous mutations among human populations. Theory helps to understand why no such differences are seen, as well as to predict in what settings they are to be expected. In particular, as predicted by modeling, there is evidence for changes in the load of recessive loss of function mutations in founder and inbred human populations. Also as predicted, eastern subspecies of gorilla, Neanderthals and Denisovans, who are thought to have undergone reductions in population sizes that exceed the human Out-of-Africa bottleneck in duration and severity, show evidence for increased load of non-synonymous mutations (relative to western subspecies of gorillas and modern humans, respectively). A coherent picture is thus starting to emerge about the effects of demographic history on the mutation load in populations of humans and close evolutionary relatives.

Download Full-text

Population analysis of retrotransposons in giraffe genomes supports RTE decline and widespread LINE1 activity in Giraffidae

Mobile DNA ◽

10.1186/s13100-021-00254-y ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Malte Petersen ◽

Sven Winter ◽

Raphael Coimbra ◽

Menno J. de Jong ◽

Vladimir V. Kapitonov ◽

...

Keyword(s):

Large Scale ◽

Population Analysis ◽

Model Organism ◽

Population Level ◽

Data Sets ◽

Ltr Retrotransposons ◽

Scale Population ◽

Mammalian Genomes ◽

Genome Assemblies ◽

First Time

Abstract Background The majority of structural variation in genomes is caused by insertions of transposable elements (TEs). In mammalian genomes, the main TE fraction is made up of autonomous and non-autonomous non-LTR retrotransposons commonly known as LINEs and SINEs (Long and Short Interspersed Nuclear Elements). Here we present one of the first population-level analysis of TE insertions in a non-model organism, the giraffe. Giraffes are ruminant artiodactyls, one of the few mammalian groups with genomes that are colonized by putatively active LINEs of two different clades of non-LTR retrotransposons, namely the LINE1 and RTE/BovB LINEs as well as their associated SINEs. We analyzed TE insertions of both types, and their associated SINEs in three giraffe genome assemblies, as well as across a population level sampling of 48 individuals covering all extant giraffe species. Results The comparative genome screen identified 139,525 recent LINE1 and RTE insertions in the sampled giraffe population. The analysis revealed a drastically reduced RTE activity in giraffes, whereas LINE1 is still actively propagating in the genomes of extant (sub)-species. In concert with the extremely low activity of the giraffe RTE, we also found that RTE-dependent SINEs, namely Bov-tA and Bov-A2, have been virtually immobile in the last 2 million years. Despite the high current activity of the giraffe LINE1, we did not find evidence for the presence of currently active LINE1-dependent SINEs. TE insertion heterozygosity rates differ among the different (sub)-species, likely due to divergent population histories. Conclusions The horizontally transferred RTE/BovB and its derived SINEs appear to be close to inactivation and subsequent extinction in the genomes of extant giraffe species. This is the first time that the decline of a TE family has been meticulously analyzed from a population genetics perspective. Our study shows how detailed information about past and present TE activity can be obtained by analyzing large-scale population-level genomic data sets.

Download Full-text

Dating genomic variants and shared ancestry in population-scale sequencing data

10.1101/416610 ◽

2018 ◽

Cited By ~ 8

Author(s):

Patrick K. Albers ◽

Gil McVean

Keyword(s):

Large Scale ◽

Modern Human ◽

Genomic Diversity ◽

Human Populations ◽

Data Sets ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Geographical Regions ◽

Shared Ancestry ◽

Using Data

AbstractThe origin and fate of new mutations within species is the fundamental process underlying evolution. However, while much attention has been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a non-parametric approach for estimating the date of origin of genetic variants in large-scale sequencing data sets. The accuracy and robustness of the approach is demonstrated through simulation. Using data from two publicly available human genomic diversity resources, we estimated the age of more than 45 million single nucleotide polymorphisms (SNPs) in the human genome and release the Atlas of Variant Age as a public online database. We characterize the relationship between variant age and frequency in different geographical regions, and demonstrate the value of age information in interpreting variants of functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the ancestry shared between individual genomes, to quantify genealogical relationships at different points in the past, as well as describe and explore the evolutionary history of modern human populations.

Download Full-text

Exploring the genomic and proteomic variations of SARS-CoV-2 spike glycoprotein: a computational biology approach

10.1101/2020.04.07.030924 ◽

2020 ◽

Cited By ~ 5

Author(s):

Syed Mohammad Lokman ◽

Md. Rasheduzzaman ◽

Asma Salauddin ◽

Rocktim Barua ◽

Afsana Yeasmin Tanzina ◽

...

Keyword(s):

Phylogenetic Analyses ◽

Evolutionary Relationship ◽

Analysis Data ◽

Host Cells ◽

Spike Protein ◽

Spike Glycoprotein ◽

Multiple Sequence ◽

Synonymous Mutations ◽

Potential Inhibitors ◽

Bat Coronavirus

AbstractThe newly identified SARS-CoV-2 has now been reported from around 183 countries with more than a million confirmed human cases including more than 68000 deaths. The genomes of SARS-COV-2 strains isolated from different parts of the world are now available and the unique features of constituent genes and proteins have gotten substantial attention recently. Spike glycoprotein is widely considered as a possible target to be explored because of its role during the entry of coronaviruses into host cells. We analyzed 320 whole-genome sequences and 320 spike protein sequences of SARS-CoV-2 using multiple sequence alignment tools. In this study, 483 unique variations have been identified among the genomes including 25 non-synonymous mutations and one deletion in the spike protein of SARS-CoV-2. Among the 26 variations detected, 12 variations were located at the N-terminal domain and 6 variations at the receptor-binding domain (RBD) which might alter the interaction with receptor molecules. In addition, 22 amino acid insertions were identified in the spike protein of SARS-CoV-2 in comparison with that of SARS-CoV. Phylogenetic analyses of spike protein revealed that Bat coronavirus have a close evolutionary relationship with circulating SARS-CoV-2. The genetic variation analysis data presented in this study can help a better understanding of SARS-CoV-2 pathogenesis. Based on our findings, potential inhibitors can be designed and tested targeting these proposed sites of variation.

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

The Functional Consequences of the Novel Ribosomal Pausing Site in SARS-CoV-2 Spike Glycoprotein RNA

International Journal of Molecular Sciences ◽

10.3390/ijms22126490 ◽

2021 ◽

Vol 22 (12) ◽

pp. 6490

Author(s):

Olga A. Postnikova ◽

Sheetal Uppal ◽

Weiliang Huang ◽

Maureen A. Kane ◽

Rafael Villasmil ◽

...

Keyword(s):

Amino Acid ◽

Insertion Sequence ◽

Experimental Studies ◽

Spike Protein ◽

Spike Glycoprotein ◽

Furin Cleavage ◽

New Host ◽

S Protein ◽

Furin Cleavage Site ◽

Ribosome Pausing

The SARS-CoV-2 Spike glycoprotein (S protein) acquired a unique new 4 amino acid -PRRA- insertion sequence at amino acid residues (aa) 681–684 that forms a new furin cleavage site in S protein as well as several new glycosylation sites. We studied various statistical properties of the -PRRA- insertion at the RNA level (CCUCGGCGGGCA). The nucleotide composition and codon usage of this sequence are different from the rest of the SARS-CoV-2 genome. One of such features is two tandem CGG codons, although the CGG codon is the rarest codon in the SARS-CoV-2 genome. This suggests that the insertion sequence could cause ribosome pausing as the result of these rare codons. Due to population variants, the Nextstrain divergence measure of the CCU codon is extremely large. We cannot exclude that this divergence might affect host immune responses/effectiveness of SARS-CoV-2 vaccines, possibilities awaiting further investigation. Our experimental studies show that the expression level of original RNA sequence “wildtype” spike protein is much lower than for codon-optimized spike protein in all studied cell lines. Interestingly, the original spike sequence produces a higher titer of pseudoviral particles and a higher level of infection. Further mutagenesis experiments suggest that this dual-effect insert, comprised of a combination of overlapping translation pausing and furin sites, has allowed SARS-CoV-2 to infect its new host (human) more readily. This underlines the importance of ribosome pausing to allow efficient regulation of protein expression and also of cotranslational subdomain folding.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text

The Age of Nonsynonymous and Synonymous Mutations in Animal mtDNA and Implications for the Mildly Deleterious Theory

Genetics ◽

10.1093/genetics/153.1.497 ◽

1999 ◽

Vol 153 (1) ◽

pp. 497-506 ◽

Cited By ~ 4

Author(s):

Rasmus Nielsen ◽

Daniel M Weinreich

Keyword(s):

Dna Sequences ◽

Purifying Selection ◽

Data Sets ◽

Deleterious Mutations ◽

Synonymous Mutations ◽

Weak Evidence ◽

Mitochondrial Data ◽

The Mean ◽

Excess Number ◽

Neutral Mutations

Abstract McDonald/Kreitman tests performed on animal mtDNA consistently reveal significant deviations from strict neutrality in the direction of an excess number of polymorphic nonsynonymous sites, which is consistent with purifying selection acting on nonsynonymous sites. We show that under models of recurrent neutral and deleterious mutations, the mean age of segregating neutral mutations is greater than the mean age of segregating selected mutations, even in the absence of recombination. We develop a test of the hypothesis that the mean age of segregating synonymous mutations equals the mean age of segregating nonsynonymous mutations in a sample of DNA sequences. The power of this age-of-mutation test and the power of the McDonald/Kreitman test are explored by computer simulations. We apply the new test to 25 previously published mitochondrial data sets and find weak evidence for selection against nonsynonymous mutations.

Download Full-text