scholarly journals SPDI: Data Model for Variants and Applications at NCBI

2019 ◽  
Author(s):  
J. Bradley Holmes ◽  
Eric Moyer ◽  
Lon Phan ◽  
Donna Maglott ◽  
Brandi L. Kattman

AbstractMotivationNormalizing diverse representations of sequence variants is critical to the elucidation of the genetic basis of disease and biological function. NCBI has long wrestled with integrating data from multiple submitters to build databases such as dbSNP and ClinVar. Inconsistent representation of variants among variant callers, local databases, and tools results in discrepancies and duplications that complicate analysis. Current tools are not robust enough to manage variants in different formats and different reference sequence coordinates.ResultsThe SPDI (pronounced “speedy”) data model defines variants as a sequence of 4 operations: start at the boundary before the first position in the sequence S, advance P positions, delete D positions, then insert the sequence in the string I, giving the data model its name, SPDI. The SPDI model can thus be applied to both nucleotide and protein variants, but the services discussed here are limited to nucleotide. Current services convert representations between HGVS, VCF, and SPDI and provide two forms of normalization. The first, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the “Contextual Allele” for any input. The SPDI name, with its four operations, defines exactly the reference subsequence potentially affected by the variant, even in low complexity regions such as homopolymer and dinucleotide sequence repeats. The second level of normalization depends on an alignment dataset (ADS). SPDI services perform remapping (AKA lift-over) of variants from the input reference sequence to return a list of all equivalent Contextual Alleles based on the transcript or genomic sequences that were aligned. One of these contextual alleles is selected to represent all, usually that based on the latest genomic assembly such as GRCh38 and is designated as the unique “Canonical Allele”. ADS includes alignments between non-assembly RefSeq sequences (prefixed NM, NR, NG), as well inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs) and this allow for robust remapping and normalization of variants across sequences and assembly versions.Availability and implementationThe SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0/[email protected]

2019 ◽  
Vol 36 (6) ◽  
pp. 1902-1907 ◽  
Author(s):  
J Bradley Holmes ◽  
Eric Moyer ◽  
Lon Phan ◽  
Donna Maglott ◽  
Brandi Kattman

Abstract Motivation Normalizing sequence variants on a reference, projecting them across congruent sequences and aggregating their diverse representations are critical to the elucidation of the genetic basis of disease and biological function. Inconsistent representation of variants among variant callers, local databases and tools result in discrepancies that complicate analysis. NCBI’s genetic variation resources, dbSNP and ClinVar, require a robust, scalable set of principles to manage asserted sequence variants. Results The SPDI data model defines variants as a sequence of four attributes: sequence, position, deletion and insertion, and can be applied to nucleotide and protein variants. NCBI web services convert representations among HGVS, VCF and SPDI and provide two functions to aggregate variants. One, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the ‘Contextual Allele’. The SPDI data model, with its four operations, defines exactly the reference subsequence affected by the variant, even in repeat regions, such as homopolymer and other sequence repeats. The second function projects variants across congruent sequences and depends on an alignment dataset of non-assembly NCBI RefSeq sequences (prefixed NM, NR and NG), as well as inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs), supporting robust projection of variants across congruent sequences and assembly versions. The variant is projected to all congruent Contextual Alleles. One of these Contextual Alleles, typically the allele based on the latest assembly version, represents the entire set, is designated the unique ‘Canonical Allele’ and is used directly to aggregate variants across congruent sequences. Availability and implementation The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
pp. PHYTO-09-20-041
Author(s):  
Christina Straub ◽  
Elena Colombi ◽  
Honour C. McCann

Population genomics is transforming our understanding of pathogen biology and evolution, and contributing to the prevention and management of disease in diverse crops. We provide an overview of key methods in bacterial population genomics and describe recent work focusing on three topics of critical importance to plant pathology: (i) resolving pathogen origins and transmission pathways during outbreak events, (ii) identifying the genetic basis of host specificity and virulence, and (iii) understanding how pathogens evolve in response to changing agricultural practices. [Formula: see text] Copyright © 2020 The Author(s). This is an open access article distributed under the CC BY-NC-ND 4.0 International license .


2016 ◽  
Author(s):  
Afif Elghraoui ◽  
Samuel J Modlin ◽  
Faramarz Valafar

AbstractThe genetic basis of virulence in Mycobacterium tuberculosis has been investigated through genome comparisons of its virulent (H37Rv) and attenuated (H37Ra) sister strains. Such analysis, however, relies heavily on the accuracy of the sequences. While the H37Rv reference genome has had several corrections to date, that of H37Ra is unmodified since its original publication. Here, we report the assembly and finishing of the H37Ra genome from single-molecule, real-time (SMRT) sequencing. Our assembly reveals that the number of H37Ra-specific variants is less than half of what the Sanger-based H37Ra reference sequence indicates, undermining and, in some cases, invalidating the conclusions of several studies. PE_PPE family genes, which are intractable to commonly-used sequencing platforms because of their repetitive and GC-rich nature, are overrepresented in the set of genes in which all reported H37Ra-specific variants are contradicted. We discuss how our results change the picture of virulence attenuation and the power of SMRT sequencing for producing high-quality reference genomes.


2022 ◽  
Vol 12 (1) ◽  
pp. 1-12
Author(s):  
Nadeem Ahmad ◽  
Zubair Sharif ◽  
Sarah Bukhari ◽  
Omer Aziz

Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation in people. SNPs are valuable resource for exploring the genetic basis of disease. The XPA gene provides a way to produce a protein used to repair damaged DNA. This study used the computational methods to classify SNPs and estimate their probability of being neutral or deleterious. The purpose of this analysis is to predict the effect of nsSNPs on the structure and function of XPA proteins. Data was collected from the NCBI hosted dbSNP. The authors examined the pathogenic effect of 194 nsSNPs in the XPA gene with computational tools. Four nsSNPs (C126S, C126W, R158S, and R227Q) those potentially effect on structure and function of the XPA protein were identified with combination of SIFT, PolyPhen, Provean, PHD-SNP, I-Mutant, ConSurf server and Project HOPE. This is the first comprehensive analysis in which XPA gene variants studied using in silico methods and this research able to gain further insight into XPA protein variants and function.


Genome ◽  
2010 ◽  
Vol 53 (10) ◽  
pp. 753-762 ◽  
Author(s):  
Wilfried Haerty ◽  
G. Brian Golding

For decades proteins were thought to interact in a “lock and key” system, which led to the definition of a paradigm linking stable three-dimensional structure to biological function. As a consequence, any non-structured peptide was considered to be nonfunctional and to evolve neutrally. Surprisingly, the most commonly shared peptides between eukaryotic proteomes are low-complexity sequences that in most conditions do not present a stable three-dimensional structure. However, because these sequences evolve rapidly and because the size variation of a few of them can have deleterious effects, low-complexity sequences have been suggested to be the target of selection. Here we review evidence that supports the idea that these simple sequences should not be considered just “junk” peptides and that selection drives the evolution of many of them.


2017 ◽  
Author(s):  
Vijay K. Ulaganathan ◽  
Axel Ullrich

AbstractA significant development in the field of human biology is the revelation of millions of unannotated protein sequence variants emerging from the several human genotyping and genome sequencing initiatives. This presents unique opportunities as well as confounding challenges in our understanding of how molecular signalling outcomes vary among individuals in the general population. As a result the conventional ‘one drug fits all’ lines of approach in the drug discovery process is becoming obsolete. However, an innovative genotype-specific approach targeting protein sequence variants instead of a reference protein target is currently lacking. In this short communication we report a remarkable observation of antibody-mediated knockdown of intracellular protein expression. This suggests allele-specific inhibition of protein-variant expression can be achieved by intracellular delivery of lipid conjugated linear epitope-specific monoclonal antibodies. The results presented here demonstrate novel opportunities for interrogating the protein coding variations in the human genomes and new therapeutic strategies for the inhibition of pathogenic protein variants in a genotype-centric manner.


2018 ◽  
Author(s):  
Jennifer Lu ◽  
Steven L. Salzberg

AbstractMetagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen.To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.


Author(s):  
Craig M. Powell ◽  
Lisa M. Monteggia

The Autisms: Molecules to Model Systems is designed to introduce the genetic basis for multiple autisms and discuss the gene mutations within the context of their biological function. The text is directed to advanced undergraduate students, graduate students, postdoctoral fellows, psychology students and professionals, psychiatrists, neurologists, and neuroscience researchers alike. It is hoped that readers will be engaged in this emerging field and will be motivated to read further and to cultivate their own understanding and constructs for future research into this enigmatic group of disorders known as the autisms.


Sign in / Sign up

Export Citation Format

Share Document