Mutation rates and selection on synonymous mutations in SARS-CoV-2

AbstractThe COVID-19 pandemic has seen an unprecedented response from the sequencing community. Leveraging the sequence data from more than 140,000 SARS-CoV-2 genomes, we study mutation rates and selective pressures affecting the virus. Understanding the processes and effects of mutation and selection has profound implications for the study of viral evolution, for vaccine design, and for the tracking of viral spread. We highlight and address some common genome sequence analysis pitfalls that can lead to inaccurate inference of mutation rates and selection, such as ignoring skews in the genetic code, not accounting for recurrent mutations, and assuming evolutionary equilibrium. We find that two particular mutation rates, G→U and C→U, are similarly elevated and considerably higher than all other mutation rates, causing the majority of mutations in the SARS-CoV-2 genome, and are possibly the result of APOBEC and ROS activity. These mutations also tend to occur many times at the same genome positions along the global SARS-CoV-2 phylogeny (i.e., they are very homoplasic). We observe an effect of genomic context on mutation rates, but the effect of the context is overall limited. While previous studies have suggested selection acting to decrease U content at synonymous sites, we bring forward evidence suggesting the opposite.

Download Full-text

Functional alterations caused by mutations reflect evolutionary trends of SARS-CoV-2

Briefings in Bioinformatics ◽

10.1093/bib/bbab042 ◽

2021 ◽

Author(s):

Liang Cheng ◽

Xudong Han ◽

Zijun Zhu ◽

Changlu Qi ◽

Ping Wang ◽

...

Keyword(s):

Reference Genome ◽

Sequence Data ◽

Purifying Selection ◽

Virus Genome ◽

Receptor Binding Domain ◽

Evolutionary Trends ◽

Synonymous Mutations ◽

Almost All ◽

Virus Strains ◽

New Mutations

Abstract Since the first report of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in December 2019, the COVID-19 pandemic has spread rapidly worldwide. Due to the limited virus strains, few key mutations that would be very important with the evolutionary trends of virus genome were observed in early studies. Here, we downloaded 1809 sequence data of SARS-CoV-2 strains from GISAID before April 2020 to identify mutations and functional alterations caused by these mutations. Totally, we identified 1017 nonsynonymous and 512 synonymous mutations with alignment to reference genome NC_045512, none of which were observed in the receptor-binding domain (RBD) of the spike protein. On average, each of the strains could have about 1.75 new mutations each month. The current mutations may have few impacts on antibodies. Although it shows the purifying selection in whole-genome, ORF3a, ORF8 and ORF10 were under positive selection. Only 36 mutations occurred in 1% and more virus strains were further analyzed to reveal linkage disequilibrium (LD) variants and dominant mutations. As a result, we observed five dominant mutations involving three nonsynonymous mutations C28144T, C14408T and A23403G and two synonymous mutations T8782C, and C3037T. These five mutations occurred in almost all strains in April 2020. Besides, we also observed two potential dominant nonsynonymous mutations C1059T and G25563T, which occurred in most of the strains in April 2020. Further functional analysis shows that these mutations decreased protein stability largely, which could lead to a significant reduction of virus virulence. In addition, the A23403G mutation increases the spike-ACE2 interaction and finally leads to the enhancement of its infectivity. All of these proved that the evolution of SARS-CoV-2 is toward the enhancement of infectivity and reduction of virulence.

Download Full-text

Learning the language of viral evolution and escape

Science ◽

10.1126/science.abd7331 ◽

2021 ◽

Vol 371 (6526) ◽

pp. 284-288 ◽

Cited By ~ 4

Author(s):

Brian Hie ◽

Ellen D. Zhong ◽

Bonnie Berger ◽

Bryan Bryson

Keyword(s):

Immune System ◽

Natural Language ◽

Vaccine Development ◽

Sequence Data ◽

Viral Evolution ◽

Machine Learning Algorithms ◽

Language Models ◽

Viral Escape ◽

Human Immune System ◽

Influenza Hemagglutinin

The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence’s grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution.

Download Full-text

Worldwide tracing of mutations and the evolutionary dynamics of SARS-CoV-2

10.1101/2020.08.07.242263 ◽

2020 ◽

Author(s):

Zhong-Yin Zhou ◽

Hang Liu ◽

Yue-Dong Zhang ◽

Yin-Qiao Wu ◽

Min-Sheng Peng ◽

...

Keyword(s):

Substitution Rate ◽

Evolutionary Dynamics ◽

Vaccine Development ◽

Sequence Data ◽

Immune Recognition ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Protein Coding ◽

Theoretical Support ◽

Recurrent Mutations

AbstractUnderstanding the mutational and evolutionary dynamics of SARS-CoV-2 is essential for treating COVID-19 and the development of a vaccine. Here, we analyzed publicly available 15,818 assembled SARS-CoV-2 genome sequences, along with 2,350 raw sequence datasets sampled worldwide. We investigated the distribution of inter-host single nucleotide polymorphisms (inter-host SNPs) and intra-host single nucleotide variations (iSNVs). Mutations have been observed at 35.6% (10,649/29,903) of the bases in the genome. The substitution rate in some protein coding regions is higher than the average in SARS-CoV-2 viruses, and the high substitution rate in some regions might be driven to escape immune recognition by diversifying selection. Both recurrent mutations and human-to-human transmission are mechanisms that generate fitness advantageous mutations. Furthermore, the frequency of three mutations (S protein, F400L; ORF3a protein, T164I; and ORF1a protein, Q6383H) has gradual increased over time on lineages, which provides new clues for the early detection of fitness advantageous mutations. Our study provides theoretical support for vaccine development and the optimization of treatment for COVID-19. We call researchers to submit raw sequence data to public databases.

Download Full-text

Genomic epidemiology reveals multiple introductions of SARS-CoV-2 followed by community and nosocomial spread, Germany, February to May 2020

Eurosurveillance ◽

10.2807/1560-7917.es.2021.26.43.2002066 ◽

2021 ◽

Vol 26 (43) ◽

Author(s):

Maximilian Muenchhoff ◽

Alexander Graf ◽

Stefan Krebs ◽

Caroline Quartucci ◽

Sandra Hasmann ◽

...

Keyword(s):

Healthcare Workers ◽

Sequence Data ◽

Phylogenetic Analyses ◽

Local Level ◽

University Hospital ◽

Metropolitan Region ◽

Viral Genomes ◽

Genomic Epidemiology ◽

Viral Spread ◽

Spatio Temporal

Background In the SARS-CoV-2 pandemic, viral genomes are available at unprecedented speed, but spatio-temporal bias in genome sequence sampling precludes phylogeographical inference without additional contextual data. Aim We applied genomic epidemiology to trace SARS-CoV-2 spread on an international, national and local level, to illustrate how transmission chains can be resolved to the level of a single event and single person using integrated sequence data and spatio-temporal metadata. Methods We investigated 289 COVID-19 cases at a university hospital in Munich, Germany, between 29 February and 27 May 2020. Using the ARTIC protocol, we obtained near full-length viral genomes from 174 SARS-CoV-2-positive respiratory samples. Phylogenetic analyses using the Auspice software were employed in combination with anamnestic reporting of travel history, interpersonal interactions and perceived high-risk exposures among patients and healthcare workers to characterise cluster outbreaks and establish likely scenarios and timelines of transmission. Results We identified multiple independent introductions in the Munich Metropolitan Region during the first weeks of the first pandemic wave, mainly by travellers returning from popular skiing areas in the Alps. In these early weeks, the rate of presumable hospital-acquired infections among patients and in particular healthcare workers was high (9.6% and 54%, respectively) and we illustrated how transmission chains can be dissected at high resolution combining virus sequences and spatio-temporal networks of human interactions. Conclusions Early spread of SARS-CoV-2 in Europe was catalysed by superspreading events and regional hotspots during the winter holiday season. Genomic epidemiology can be employed to trace viral spread and inform effective containment strategies.

Download Full-text

Stability of SARS-CoV-2 Phylogenies

10.1101/2020.06.08.141127 ◽

2020 ◽

Cited By ~ 3

Author(s):

Yatish Turakhia ◽

Bryan Thornlow ◽

Landen Gozashti ◽

Angie S. Hinrichs ◽

Jason D. Fernandes ◽

...

Keyword(s):

Binding Sites ◽

Sequence Data ◽

Scientific Discovery ◽

Lineage Tracing ◽

Protein Coding ◽

Sequencing Errors ◽

Scientific Inference ◽

Recurrent Mutations ◽

Sequence Quality ◽

Essential Sequence

AbstractThe SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation and/or recombination among viral lineages. We suggest how samples can be screened and problematic mutations removed. We also develop tools for comparing and visualizing differences among phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.ForewordWe wish to thank all groups that responded rapidly by producing these invaluable and essential sequence data. Their contributions have enabled an unprecedented, lightning-fast process of scientific discovery---truly an incredible benefit for humanity and for the scientific community. We emphasize that most lab groups with whom we associate specific suspicious alleles are also those who have produced the most sequence data at a time when it was urgently needed. We commend their efforts. We have already contacted each group and many have updated their sequences. Our goal with this work is not to highlight potential errors, but to understand the impacts of these and other kinds of highly recurrent mutations so as to identify commonalities among the suspicious examples that can improve sequence quality and analysis going forward.

Download Full-text

Paleovirology: inferring viral evolution from host genome sequence data

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2012.0493 ◽

2013 ◽

Vol 368 (1626) ◽

pp. 20120493 ◽

Cited By ~ 36

Author(s):

Aris Katzourakis

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Viral Evolution ◽

Host Genome ◽

Genome Sequence Data

Download Full-text

DeepCLIP: predicting the effect of mutations on protein–RNA binding with deep learning

Nucleic Acids Research ◽

10.1093/nar/gkaa530 ◽

2020 ◽

Author(s):

Alexander Gulliver Bjørnholt Grønning ◽

Thomas Koed Doktor ◽

Simon Jonas Larsen ◽

Ulrika Simone Spangsberg Petersen ◽

Lise Lolle Holm ◽

...

Keyword(s):

Protein Binding ◽

Rna Binding ◽

Rna Binding Proteins ◽

Sequence Data ◽

Genomic Context ◽

Functional Changes ◽

Lab Experiments ◽

Sequence Variations ◽

Wet Lab ◽

Nuclear Shuttling

Abstract Nucleotide variants can cause functional changes by altering protein–RNA binding in various ways that are not easy to predict. This can affect processes such as splicing, nuclear shuttling, and stability of the transcript. Therefore, correct modeling of protein–RNA binding is critical when predicting the effects of sequence variations. Many RNA-binding proteins recognize a diverse set of motifs and binding is typically also dependent on the genomic context, making this task particularly challenging. Here, we present DeepCLIP, the first method for context-aware modeling and predicting protein binding to RNA nucleic acids using exclusively sequence data as input. We show that DeepCLIP outperforms existing methods for modeling RNA-protein binding. Importantly, we demonstrate that DeepCLIP predictions correlate with the functional outcomes of nucleotide variants in independent wet lab experiments. Furthermore, we show how DeepCLIP binding profiles can be used in the design of therapeutically relevant antisense oligonucleotides, and to uncover possible position-dependent regulation in a tissue-specific manner. DeepCLIP is freely available as a stand-alone application and as a webtool at http://deepclip.compbio.sdu.dk.

Download Full-text

Whole-exome sequencing (WES) of penile squamous cell carcinoma (PSCC) to identify multiple recurrent mutations.

Journal of Clinical Oncology ◽

10.1200/jco.2016.34.2_suppl.484 ◽

2016 ◽

Vol 34 (2_suppl) ◽

pp. 484-484 ◽

Cited By ~ 1

Author(s):

Gurudatta Naik ◽

Dongquan Chen ◽

Michael Crowley ◽

David Crossman ◽

Katherine C. Sexton ◽

...

Keyword(s):

Normal Tissue ◽

Reference Genome ◽

Sequence Data ◽

The Cancer Genome Atlas ◽

Exome Capture ◽

Missense Mutations ◽

Adjacent Normal Tissue ◽

Human Reference Genome ◽

Whole Exome ◽

Recurrent Mutations

484 Background: Molecular alterations and drivers of PSCC, an orphan malignancy, remain unclear. The Cancer Genome Atlas is not studying PSCC and the Catalogue of Somatic Mutations in Cancer has performed targeted analyses only. We report WES of PSCC tumors from a group of patients (pts). Methods: Freshfrozen macrodissected PSCC tumor tissue and adjacent normal tissue samples were procured from the Cooperative Human Tissue Network. DNA was isolated from tissue sections by phenol chloroform extraction. Exome capture was performed with the Agilent SureSelect clinical research exome kit and whole exome-seq was done on the Illumina HiSeq2500 with paired end 100bp chemistry. Raw sequence data in Fastq format were aligned to human reference genome and quantified, and compared by using a local instance of Galaxy (galaxy.uabgrid.uab.edu). These data were analyzed for mutations (SNPs) analysis, by Partek Genomic Suite/Flow(PGS, Partek, St. Louis, MO) for variance calling against human reference genome (hg19) as referenced to dbSNP; and copy number variants (cnv) by FishingCNV tool together with picard tools/samtools/GATK). We focused on missense mutations and amplifications among ≥ 2 tumor samples but not in normal samples as they may cause upregulation of gene/protein function, which may be therapeutically actionable. Results: PSCC tumors were available from 11 patients and adjacent normal tissue from 3 patients. The 10 most common genes with > 4 missense mutations among ≥ 2 tumor samples overall were the following in decreasing order of frequency: MUC4, HLA-DPA1, MUC16, XIRP2, SSPO, TTN, FCGBP, PABPC3, ALPK2 and MKI67. The top upstream transcriptional regulators were PIH1D3, PRDM5, PTK2, Coup-Tf and NBEAL2. When examining candidate actionable genes, recurrent missense alterations were seen in PIK3C2A and PIK3C2G. Additional analysis will study alterations in functional domains and cnv. Conclusions: WES identified a relatively high mutation burden in PSCC withrecurrent missense mutations in multiple genes, notably including the PI3K gene among potentially actionable genes. Validation of these findings and further study of downstream effects is required.

Download Full-text

TRANSCRIPT ANNOTATION PRIORITIZATION AND SCREENING SYSTEM (TrAPSS) FOR MUTATION SCREENING

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720007003132 ◽

2007 ◽

Vol 05 (06) ◽

pp. 1155-1172 ◽

Cited By ~ 1

Author(s):

BRIAN M. O'LEARY ◽

STEVEN G. DAVIS ◽

MICHAEL F. SMITH ◽

BARTLEY BROWN ◽

MATHEW B. KEMP ◽

...

Keyword(s):

Candidate Genes ◽

Information Management ◽

Sequence Data ◽

Mutation Screening ◽

Disease Genes ◽

Screening System ◽

Genomic Context ◽

Significant Information ◽

Algorithmic Technique ◽

Protein Functional Domains

When searching for disease-causing mutations with polymerase chain reaction (PCR)-based methods, candidate genes are usually screened in their entirety, exon by exon. Genomic resources (i.e. www.ncbi.nih.gov, www.ensembl.org, and genome.ucsc.edu) largely support this paradigm for mutation screening by making it easy to view and access sequence data associated with genes in their genomic context. However, the administrative burden of conducting mutation screening in potentially hundreds of genes and thousands of exons in thousands of patients is significant, even with the use of public genome resources. For example, the manual design of oligonucleotide primers for all exons of the 10 Leber's congenital amaurosis (LCA) genes (149 exons) represents a significant information management challenge. The Transcript Annotation Prioritization and Screening System (TrAPSS) is designed to accelerate mutation screening by (1) providing a gene-based local cache of candidate disease genes in a genomic context, (2) automating tasks associated with optimizing candidate disease gene screening and information management, and (3) providing the implementation of an algorithmic technique to utilize large amounts of heterogeneous genome annotation (e.g. conserved protein functional domains) so as to prioritize candidate genes.

Download Full-text

Rapid longitudinal SARS-CoV-2 intra-host emergence of novel haplotypes regardless of immune deficiencies

10.1101/2021.12.22.473949 ◽

2021 ◽

Author(s):

Laura Manuto ◽

Marco Grazioli ◽

Andrea Spitaleri ◽

Paolo Fontana ◽

Luca Bianco ◽

...

Keyword(s):

Viral Evolution ◽

Viral Genomes ◽

Network Analyses ◽

Viral Spread ◽

Minimum Spanning Network ◽

Immune Deficiencies ◽

Chinese Tourists ◽

Containment Strategy ◽

First Time ◽

Viral Sequences

On February 2020, the municipality of Vo’, a small town near Padua (Italy), was quarantined due to the first coronavirus disease 19 (COVID-19)-related death detected in Italy. The entire population was swab tested in two sequential surveys. Here we report the analysis of the viral genomes, which revealed that the unique ancestor haplotype introduced in Vo’ belongs to lineage B and, more specifically, to the subtype found at the end of January 2020 in two Chinese tourists visiting Rome and other Italian cities, carrying mutations G11083T and G26144T. The sequences, obtained for 87 samples, allowed us to investigate viral evolution while being transmitted within and across households and the effectiveness of the non-pharmaceutical interventions implemented in Vo’. We report, for the first time, evidence that novel viral haplotypes can naturally arise intra-host within an interval as short as two weeks, in approximately 30% of the infected individuals, regardless of symptoms severity or immune system deficiencies. Moreover, both phylogenetic and minimum spanning network analyses converge on the hypothesis that the viral sequences evolved from a unique common ancestor haplotype, carried by an index case. The lockdown extinguished both viral spread and the emergence of new variants, confirming the efficiency of this containment strategy. The information gathered from household was used to reconstructs possible transmission events.

Download Full-text