Prediction of evolutionary constraint by genomic annotations improves prioritization of causal variants in maize

Mapping Intimacies ◽

10.1101/2021.09.03.458856 ◽

2021 ◽

Author(s):

Guillaume P. Ramstein ◽

Edward S. Buckler

Keyword(s):

Genomic Prediction ◽

Crop Improvement ◽

Gc Content ◽

Single Site ◽

Evolutionary Constraint ◽

Sequence Alignments ◽

Multiple Sequence ◽

Base Editing ◽

Synonymous Mutations ◽

Causal Variants

AbstractCrop improvement through cross-population genomic prediction and genome editing requires identification of causal variants at single-site resolution. Most genetic mapping studies have generally lacked such resolution. In contrast, evolutionary approaches can detect genetic effects at high resolution, but they are limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Here we used genomic annotations to accurately predict nucleotide conservation across Angiosperms, as a proxy for fitness effect of mutations. Using only sequence analysis, we annotated non-synonymous mutations in 25,824 maize gene models, with information from bioinformatics (SIFT scores, GC content, transposon insertion, k-mer frequency) and deep learning (predicted effects of polymorphisms on protein representations by UniRep). Our predictions were validated by experimental information: within-species conservation, chromatin accessibility, gene expression and gene ontology enrichment. Importantly, they also improved genomic prediction for fitness-related traits (grain yield) in elite maize panels (+5% and +38% prediction accuracy within and across panels, respectively), by stringent prioritization of ≤ 1% of single-site variants (e.g., 104 sites and approximately 15 deleterious alleles per haploid genome). Together, our results suggest that our proposed approach may effectively prioritize sites most likely to impact fitness-related traits in crops. Such prioritizations could be useful to select polymorphisms for accurate genomic prediction, and candidate mutations for efficient base editing.

Download Full-text

Faculty Opinions recommendation of Evolutionary profiles from the QR factorization of multiple sequence alignments.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1024515.296730 ◽

2005 ◽

Author(s):

Anne-Catherine Dock-Bregeon

Keyword(s):

Qr Factorization ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Faculty Opinions recommendation of Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.732011981.793542976 ◽

2018 ◽

Author(s):

Chandra Verma ◽

Suryani Lukman

Keyword(s):

Machine Learning ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments

Download Full-text

Positive natural selection in primate genes of the type I interferon response

BMC Ecology and Evolution ◽

10.1186/s12862-021-01783-z ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Elena N. Judd ◽

Alison R. Gilchrist ◽

Nicholas R. Meyerson ◽

Sara L. Sawyer

Keyword(s):

Natural Selection ◽

Positive Selection ◽

Type I Interferon ◽

Interferon Response ◽

Type I ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Interferon Stimulated Genes ◽

Interferon Induction

Abstract Background The Type I interferon response is an important first-line defense against viruses. In turn, viruses antagonize (i.e., degrade, mis-localize, etc.) many proteins in interferon pathways. Thus, hosts and viruses are locked in an evolutionary arms race for dominance of the Type I interferon pathway. As a result, many genes in interferon pathways have experienced positive natural selection in favor of new allelic forms that can better recognize viruses or escape viral antagonists. Here, we performed a holistic analysis of selective pressures acting on genes in the Type I interferon family. We initially hypothesized that the genes responsible for inducing the production of interferon would be antagonized more heavily by viruses than genes that are turned on as a result of interferon. Our logic was that viruses would have greater effect if they worked upstream of the production of interferon molecules because, once interferon is produced, hundreds of interferon-stimulated proteins would activate and the virus would need to counteract them one-by-one. Results We curated multiple sequence alignments of primate orthologs for 131 genes active in interferon production and signaling (herein, “induction” genes), 100 interferon-stimulated genes, and 100 randomly chosen genes. We analyzed each multiple sequence alignment for the signatures of recurrent positive selection. Counter to our hypothesis, we found the interferon-stimulated genes, and not interferon induction genes, are evolving significantly more rapidly than a random set of genes. Interferon induction genes evolve in a way that is indistinguishable from a matched set of random genes (22% and 18% of genes bear signatures of positive selection, respectively). In contrast, interferon-stimulated genes evolve differently, with 33% of genes evolving under positive selection and containing a significantly higher fraction of codons that have experienced selection for recurrent replacement of the encoded amino acid. Conclusion Viruses may antagonize individual products of the interferon response more often than trying to neutralize the system altogether.

Download Full-text

Insight into the bZIP Gene Family in Solanum tuberosum: Genome and Transcriptome Analysis to Understand the Roles of Gene Diversification in Spatiotemporal Gene Expression and Function

International Journal of Molecular Sciences ◽

10.3390/ijms22010253 ◽

2020 ◽

Vol 22 (1) ◽

pp. 253

Author(s):

Venura Herath ◽

Jeanmarie Verchot

Keyword(s):

Gene Expression ◽

Gene Family ◽

Functional Groups ◽

Leucine Zipper ◽

Expression Profiles ◽

Crop Improvement ◽

Potato Virus X ◽

Tissue Growth ◽

Abiotic Stress Response ◽

Sequence Alignments

The basic region-leucine zipper (bZIP) transcription factors (TFs) form homodimers and heterodimers via the coil–coil region. The bZIP dimerization network influences gene expression across plant development and in response to a range of environmental stresses. The recent release of the most comprehensive potato reference genome was used to identify 80 StbZIP genes and to characterize their gene structure, phylogenetic relationships, and gene expression profiles. The StbZIP genes have undergone 22 segmental and one tandem duplication events. Ka/Ks analysis suggested that most duplications experienced purifying selection. Amino acid sequence alignments and phylogenetic comparisons made with the Arabidopsis bZIP family were used to assign the StbZIP genes to functional groups based on the Arabidopsis orthologs. The patterns of introns and exons were conserved within the assigned functional groups which are supportive of the phylogeny and evidence of a common progenitor. Inspection of the leucine repeat heptads within the bZIP domains identified a pattern of attractive pairs favoring homodimerization, and repulsive pairs favoring heterodimerization. These patterns of attractive and repulsive heptads were similar within each functional group for Arabidopsis and S. tuberosum orthologs. High-throughput RNA-seq data indicated the most highly expressed and repressed genes that might play significant roles in tissue growth and development, abiotic stress response, and response to pathogens including Potato virus X. These data provide useful information for further functional analysis of the StbZIP gene family and their potential applications in crop improvement.

Download Full-text

Respiratory syncytial virus B sequence analysis reveals a novel early genotype

Scientific Reports ◽

10.1038/s41598-021-83079-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Juan C. Muñoz-Escalante ◽

Andreu Comas-García ◽

Sofía Bernal-Silva ◽

Daniel E. Noyola

Keyword(s):

Molecular Markers ◽

Respiratory Syncytial Virus ◽

Maximum Likelihood ◽

Respiratory Infections ◽

Phylogenetic Analyses ◽

Detection Methods ◽

Sequence Alignments ◽

Multiple Sequence ◽

Individual Gene ◽

Syncytial Virus

AbstractRespiratory syncytial virus (RSV) is a major cause of respiratory infections and is classified in two main groups, RSV-A and RSV-B, with multiple genotypes within each of them. For RSV-B, more than 30 genotypes have been described, without consensus on their definition. The lack of genotype assignation criteria has a direct impact on viral evolution understanding, development of viral detection methods as well as vaccines design. Here we analyzed the totality of complete RSV-B G gene ectodomain sequences published in GenBank until September 2018 (n = 2190) including 478 complete genome sequences using maximum likelihood and Bayesian phylogenetic analyses, as well as intergenotypic and intragenotypic distance matrices, in order to generate a systematic genotype assignation. Individual RSV-B genes were also assessed using maximum likelihood phylogenetic analyses and multiple sequence alignments were used to identify molecular markers associated to specific genotypes. Analyses of the complete G gene ectodomain region, sequences clustering patterns, and the presence of molecular markers of each individual gene indicate that the 37 previously described genotypes can be classified into fifteen distinct genotypes: BA, BA-C, BA-CC, CB1-THB, GB1-GB4, GB6, JAB1-NZB2, SAB1, SAB2, SAB4, URU2 and a novel early circulating genotype characterized in the present study and designated GB0.

Download Full-text

SNN-SB: Combining Partial Alignment Using Modified SNN Algorithm with Segment-Based for Multiple Sequence Alignments

Journal of Physics Conference Series ◽

10.1088/1742-6596/1962/1/012048 ◽

2021 ◽

Vol 1962 (1) ◽

pp. 012048

Author(s):

Aziz Nasser Boraik Ali ◽

Hassan Pyar Ali Hassan ◽

Hesham Bahamish

Keyword(s):

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Partial Alignment

Download Full-text

DNCON2_Inter: predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning

Scientific Reports ◽

10.1038/s41598-021-91827-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Farhan Quadir ◽

Raj S. Roy ◽

Randal Halfmann ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Tertiary Structure ◽

Protein Complexes ◽

Complex Structure ◽

Great Success ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Residue Contacts ◽

Evolutionary Features

AbstractDeep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins. However, these methods require multiple sequence alignments (MSAs) of a pair of interacting proteins (dimers) as input, which are often difficult to obtain because there are not many known protein complexes available to generate MSAs of sufficient depth for a pair of proteins. In recognizing that multiple sequence alignments of a monomer that forms homomultimers contain the co-evolutionary signals of both intrachain and interchain residue pairs in contact, we applied DNCON2 (a deep learning-based protein intrachain residue-residue contact predictor) to predict both intrachain and interchain contacts for homomultimers using multiple sequence alignment (MSA) and other co-evolutionary features of a single monomer followed by discrimination of interchain and intrachain contacts according to the tertiary structure of the monomer. We name this tool DNCON2_Inter. Allowing true-positive predictions within two residue shifts, the best average precision was obtained for the Top-L/10 predictions of 22.9% for homodimers and 17.0% for higher-order homomultimers. In some instances, especially where interchain contact densities are high, DNCON2_Inter predicted interchain contacts with 100% precision. We also developed Con_Complex, a complex structure reconstruction tool that uses predicted contacts to produce the structure of the complex. Using Con_Complex, we show that the predicted contacts can be used to accurately construct the structure of some complexes. Our experiment demonstrates that monomeric multiple sequence alignments can be used with deep learning to predict interchain contacts of homomeric proteins.

Download Full-text

Exploratory analysis of multiple sequence alignments using phylogenies

Bioinformatics ◽

10.1093/bioinformatics/10.3.243 ◽

1994 ◽

Vol 10 (3) ◽

pp. 243-247

Author(s):

Brian Golding

Keyword(s):

Exploratory Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Gleaning structural and functional information from correlations in protein multiple sequence alignments

Current Opinion in Structural Biology ◽

10.1016/j.sbi.2016.04.006 ◽

2016 ◽

Vol 38 ◽

pp. 1-8 ◽

Cited By ~ 7

Author(s):

Andrew F Neuwald

Keyword(s):

Functional Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Molecular homology and multiple-sequence alignment: an analysis of concepts and practice

Australian Systematic Botany ◽

10.1071/sb15001 ◽

2015 ◽

Vol 28 (1) ◽

pp. 46 ◽

Cited By ~ 20

Author(s):

David A. Morrison ◽

Matthew J. Morgan ◽

Scot A. Kelchner

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Molecular Data ◽

Simple Relationship ◽

Sequence Alignments ◽

Multiple Sequence ◽

Molecular Change ◽

Nucleotide Homology ◽

Tree Building ◽

Molecular Homology

Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.

Download Full-text