SPDI: data model for variants and applications at NCBI

Abstract Motivation Normalizing sequence variants on a reference, projecting them across congruent sequences and aggregating their diverse representations are critical to the elucidation of the genetic basis of disease and biological function. Inconsistent representation of variants among variant callers, local databases and tools result in discrepancies that complicate analysis. NCBI’s genetic variation resources, dbSNP and ClinVar, require a robust, scalable set of principles to manage asserted sequence variants. Results The SPDI data model defines variants as a sequence of four attributes: sequence, position, deletion and insertion, and can be applied to nucleotide and protein variants. NCBI web services convert representations among HGVS, VCF and SPDI and provide two functions to aggregate variants. One, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the ‘Contextual Allele’. The SPDI data model, with its four operations, defines exactly the reference subsequence affected by the variant, even in repeat regions, such as homopolymer and other sequence repeats. The second function projects variants across congruent sequences and depends on an alignment dataset of non-assembly NCBI RefSeq sequences (prefixed NM, NR and NG), as well as inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs), supporting robust projection of variants across congruent sequences and assembly versions. The variant is projected to all congruent Contextual Alleles. One of these Contextual Alleles, typically the allele based on the latest assembly version, represents the entire set, is designated the unique ‘Canonical Allele’ and is used directly to aggregate variants across congruent sequences. Availability and implementation The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPDI: Data Model for Variants and Applications at NCBI

10.1101/537449 ◽

2019 ◽

Cited By ~ 1

Author(s):

J. Bradley Holmes ◽

Eric Moyer ◽

Lon Phan ◽

Donna Maglott ◽

Brandi L. Kattman

Keyword(s):

Open Access ◽

Data Model ◽

Genetic Basis ◽

Biological Function ◽

Low Complexity ◽

Reference Sequence ◽

Genomic Sequences ◽

Sequence Variants ◽

Correction Algorithm ◽

Protein Variants

AbstractMotivationNormalizing diverse representations of sequence variants is critical to the elucidation of the genetic basis of disease and biological function. NCBI has long wrestled with integrating data from multiple submitters to build databases such as dbSNP and ClinVar. Inconsistent representation of variants among variant callers, local databases, and tools results in discrepancies and duplications that complicate analysis. Current tools are not robust enough to manage variants in different formats and different reference sequence coordinates.ResultsThe SPDI (pronounced “speedy”) data model defines variants as a sequence of 4 operations: start at the boundary before the first position in the sequence S, advance P positions, delete D positions, then insert the sequence in the string I, giving the data model its name, SPDI. The SPDI model can thus be applied to both nucleotide and protein variants, but the services discussed here are limited to nucleotide. Current services convert representations between HGVS, VCF, and SPDI and provide two forms of normalization. The first, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the “Contextual Allele” for any input. The SPDI name, with its four operations, defines exactly the reference subsequence potentially affected by the variant, even in low complexity regions such as homopolymer and dinucleotide sequence repeats. The second level of normalization depends on an alignment dataset (ADS). SPDI services perform remapping (AKA lift-over) of variants from the input reference sequence to return a list of all equivalent Contextual Alleles based on the transcript or genomic sequences that were aligned. One of these contextual alleles is selected to represent all, usually that based on the latest genomic assembly such as GRCh38 and is designated as the unique “Canonical Allele”. ADS includes alignments between non-assembly RefSeq sequences (prefixed NM, NR, NG), as well inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs) and this allow for robust remapping and normalization of variants across sequences and assembly versions.Availability and implementationThe SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0/[email protected]

Download Full-text

SARS2020: an integrated platform for identification of novel coronavirus by a consensus sequence-function model

Bioinformatics ◽

10.1093/bioinformatics/btaa767 ◽

2020 ◽

Author(s):

Dachuan Zhang ◽

Tong Zhang ◽

Sheng Liu ◽

Dandan Sun ◽

Shaozhen Ding ◽

...

Keyword(s):

Biological Function ◽

Consensus Sequence ◽

Rapid Identification ◽

Data Driven ◽

Supplementary Information ◽

Respiratory Syndrome Virus ◽

The Novel ◽

Function Model ◽

Catalytic Function ◽

Novel Coronavirus

Abstract Motivation The 2019 novel coronavirus outbreak has significantly affected global health and society. Thus, predicting biological function from pathogen sequence is crucial and urgently needed. However, little work has been conducted to identify viruses by the enzymes that they encode, and which are key to pathogen propagation. Results We built a comprehensive scientific resource, SARS2020, which integrates coronavirus-related research, genomic sequences and results of anti-viral drug trials. In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the severe acute respiratory syndrome virus. This data-driven sequence-based strategy will enable rapid identification of agents responsible for future epidemics. Availabilityand implementation SARS2020 is available at http://design.rxnfinder.org/sars2020/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Likelihood-based Fits of Folding Transitions (LiFFT) for Biomolecule Mapping Data

10.1101/294041 ◽

2018 ◽

Author(s):

Rhiju Das

Keyword(s):

Small Molecules ◽

Biological Function ◽

Visual Assessment ◽

Supplementary Information ◽

Chemical Mapping ◽

Mapping Data ◽

Modeling Uncertainties ◽

Multi Wavelength ◽

Mapping Techniques ◽

Matlab Package

AbstractSummaryBiomolecules shift their structures as a function of temperature and concentrations of protons, ions, small molecules, proteins, and nucleic acids. These transitions impact or underlie biological function and are being monitored at increasingly high throughput. For example, folding transitions for large collections of RNAs can now be monitored at single residue resolution by chemical mapping techniques. LIkelihood-based Fits of Folding Transitions (LIFFT) quantifies these data through well-defined thermodynamic models. LIFFT implements a Bayesian framework that takes into account data at all measured residues and enables visual assessment of modeling uncertainties that can be overlooked in least-squares fits. The framework is appropriate for multimodal techniques ranging from chemical mapping including multi-wavelength spectroscopy.AvailabilityFreely available MATLAB package at https://ribokit.stanford.edu/LIFFT/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

HaploGrouper: a generalized approach to haplogroup classification

Bioinformatics ◽

10.1093/bioinformatics/btaa729 ◽

2020 ◽

Author(s):

Anuradha Jagadeesan ◽

S Sunna Ebenesersdóttir ◽

Valdis B Guðmundsdóttir ◽

Elisabet Linda Thordardottir ◽

Kristjan H S Moore ◽

...

Keyword(s):

Mitochondrial Dna ◽

Phylogenetic Tree ◽

Y Chromosome ◽

State Of The Art ◽

Supplementary Information ◽

Sequence Variants ◽

Use Case ◽

Supplementary Data ◽

Human Mitochondrial Dna ◽

Comparable Accuracy

Abstract Motivation We introduce HaploGrouper, a versatile software to classify haplotypes into haplogroups on the basis of a known phylogenetic tree. A typical use case for this software is the assignment of haplogroups to human mitochondrial DNA (mtDNA) or Y-chromosome haplotypes. Existing state-of-the-art haplogroup-calling software is typically hard-wired to work only with either mtDNA or Y-chromosome haplotypes from humans. Results HaploGrouper exhibits comparable accuracy in these instances and has the advantage of being able to assign haplogroups to any kind of haplotypes from any species—given an extant annotated phylogenetic tree defined by sequence variants. Availability and implementation The software is available at the following URL https://gitlab.com/bio_anth_decode/haploGrouper. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GlyGen data model and processing workflow

Bioinformatics ◽

10.1093/bioinformatics/btaa238 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3941-3943 ◽

Cited By ~ 6

Author(s):

Robel Kahsay ◽

Jeet Vora ◽

Rahi Navelkar ◽

Reza Mousavi ◽

Brian C Fochtman ◽

...

Keyword(s):

Data Model ◽

Data Access ◽

Supplementary Information ◽

Use Case ◽

Creative Commons ◽

Related Data ◽

Sparql Endpoint ◽

International Data ◽

Data Portal ◽

Key Resources

Abstract Summary Glycoinformatics plays a major role in glycobiology research, and the development of a comprehensive glycoinformatics knowledgebase is critical. This application note describes the GlyGen data model, processing workflow and the data access interfaces featuring programmatic use case example queries based on specific biological questions. The GlyGen project is a data integration, harmonization and dissemination project for carbohydrate and glycoconjugate-related data retrieved from multiple international data sources including UniProtKB, GlyTouCan, UniCarbKB and other key resources. Availability and implementation GlyGen web portal is freely available to access at https://glygen.org. The data portal, web services, SPARQL endpoint and GitHub repository are also freely available at https://data.glygen.org, https://api.glygen.org, https://sparql.glygen.org and https://github.com/glygener, respectively. All code is released under license GNU General Public License version 3 (GNU GPLv3) and is available on GitHub https://github.com/glygener. The datasets are made available under Creative Commons Attribution 4.0 International (CC BY 4.0) license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Insights Into Functional and Structural Impacts of nsSNPs in XPA-DNA Repairing Gene

International Journal of Applied Research in Bioinformatics ◽

10.4018/ijarb.2022010103 ◽

2022 ◽

Vol 12 (1) ◽

pp. 1-12

Author(s):

Nadeem Ahmad ◽

Zubair Sharif ◽

Sarah Bukhari ◽

Omer Aziz

Keyword(s):

Genetic Basis ◽

Structure And Function ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Computational Tools ◽

Project Hope ◽

Structural Impacts ◽

And Function ◽

Protein Variants ◽

Insight Into

Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation in people. SNPs are valuable resource for exploring the genetic basis of disease. The XPA gene provides a way to produce a protein used to repair damaged DNA. This study used the computational methods to classify SNPs and estimate their probability of being neutral or deleterious. The purpose of this analysis is to predict the effect of nsSNPs on the structure and function of XPA proteins. Data was collected from the NCBI hosted dbSNP. The authors examined the pathogenic effect of 194 nsSNPs in the XPA gene with computational tools. Four nsSNPs (C126S, C126W, R158S, and R227Q) those potentially effect on structure and function of the XPA protein were identified with combination of SIFT, PolyPhen, Provean, PHD-SNP, I-Mutant, ConSurf server and Project HOPE. This is the first comprehensive analysis in which XPA gene variants studied using in silico methods and this research able to gain further insight into XPA protein variants and function.

Download Full-text

Antibody-mediated depletion of protein variant expression in living cells (Protein interference)

10.1101/119305 ◽

2017 ◽

Author(s):

Vijay K. Ulaganathan ◽

Axel Ullrich

Keyword(s):

Protein Sequence ◽

Sequence Variants ◽

Intracellular Protein ◽

Protein Variant ◽

Protein Coding ◽

Significant Development ◽

Human Genomes ◽

Reference Protein ◽

Allele Specific ◽

Protein Variants

AbstractA significant development in the field of human biology is the revelation of millions of unannotated protein sequence variants emerging from the several human genotyping and genome sequencing initiatives. This presents unique opportunities as well as confounding challenges in our understanding of how molecular signalling outcomes vary among individuals in the general population. As a result the conventional ‘one drug fits all’ lines of approach in the drug discovery process is becoming obsolete. However, an innovative genotype-specific approach targeting protein sequence variants instead of a reference protein target is currently lacking. In this short communication we report a remarkable observation of antibody-mediated knockdown of intracellular protein expression. This suggests allele-specific inhibition of protein-variant expression can be achieved by intracellular delivery of lipid conjugated linear epitope-specific monoclonal antibodies. The results presented here demonstrate novel opportunities for interrogating the protein coding variations in the human genomes and new therapeutic strategies for the inhibition of pathogenic protein variants in a genotype-centric manner.

Download Full-text

TraPS-VarI: a python module for the identification of STAT3 modulating germline receptor variants

10.1101/173047 ◽

2017 ◽

Cited By ~ 1

Author(s):

Daniel Kogan ◽

Vijay Kumar Ulaganathan

Keyword(s):

Cell Line ◽

Dna Sequences ◽

Binding Sites ◽

Cancer Prognosis ◽

Supplementary Information ◽

Type I ◽

Coding Region ◽

Link Type ◽

Juxtamembrane Region ◽

Protein Variants

AbstractMotivationHuman individuals differ because of variations in the DNA sequences of all the 46 chromosomes. Information on genetic variations altering the membrane-proximal binding sites for signal transducer of transcription 3 (STAT3) is valuable for understanding the genetic basis of cancer prognosis and disease progression (Ulaganathan et al, 2015). In this regard, non-synonymous coding region mutations resulting in the alteration of protein sequence in the juxtamembrane region of the type I membrane proteins are biologically and clinically relevant. The knowledge of such rare cell line- and individual-specific germline receptor variants is crucial for the investigation of cell-line specific biological mechanisms and genotype-centric therapeutic approaches.ResultsHere we present TraPS-VarI (Transmembrane Protein Sequence Variant Identifier), a python module to rapidly identify human germline receptor variants modulating STAT3 binding sites by using the genetic variation datasets in the variant call format 4.0. For the found protein variants the module also checks for the availability of associated therapeutic agents and ongoing clinical trial studies.AvailabilityThe Source code and binaries are freely available for download at https://gitlab.com/VJ-Ulaganathan/TraPS-VarI and the documentation can be found at http://traps-vari.readthedocs.io/[email protected] & [email protected] informationSupplementary data enclosed with the manuscript file.

Download Full-text

Introduction

10.1093/med/9780199744312.003.0001 ◽

2013 ◽

Author(s):

Craig M. Powell ◽

Lisa M. Monteggia

Keyword(s):

Graduate Students ◽

Undergraduate Students ◽

Genetic Basis ◽

Biological Function ◽

Gene Mutations ◽

Psychology Students ◽

Model Systems ◽

Future Research

The Autisms: Molecules to Model Systems is designed to introduce the genetic basis for multiple autisms and discuss the gene mutations within the context of their biological function. The text is directed to advanced undergraduate students, graduate students, postdoctoral fellows, psychology students and professionals, psychiatrists, neurologists, and neuroscience researchers alike. It is hoped that readers will be engaged in this emerging field and will be motivated to read further and to cultivate their own understanding and constructs for future research into this enigmatic group of disorders known as the autisms.

Download Full-text

PatientExploreR: an extensible application for dynamic visualization of patient clinical history from electronic health records in the OMOP common data model

Bioinformatics ◽

10.1093/bioinformatics/btz409 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4515-4518 ◽

Cited By ~ 5

Author(s):

Benjamin S Glicksberg ◽

Boris Oskotsky ◽

Phyllis M Thangaraj ◽

Nicholas Giangreco ◽

Marcus A Badgeley ◽

...

Keyword(s):

Electronic Health Records ◽

Data Model ◽

Domain Knowledge ◽

Clinical History ◽

Supplementary Information ◽

Common Data Model ◽

Dynamic Visualization ◽

Health Records ◽

Patient Level ◽

Electronic Health

AbstractMotivationElectronic health records (EHRs) are quickly becoming omnipresent in healthcare, but interoperability issues and technical demands limit their use for biomedical and clinical research. Interactive and flexible software that interfaces directly with EHR data structured around a common data model (CDM) could accelerate more EHR-based research by making the data more accessible to researchers who lack computational expertise and/or domain knowledge.ResultsWe present PatientExploreR, an extensible application built on the R/Shiny framework that interfaces with a relational database of EHR data in the Observational Medical Outcomes Partnership CDM format. PatientExploreR produces patient-level interactive and dynamic reports and facilitates visualization of clinical data without any programming required. It allows researchers to easily construct and export patient cohorts from the EHR for analysis with other software. This application could enable easier exploration of patient-level data for physicians and researchers. PatientExploreR can incorporate EHR data from any institution that employs the CDM for users with approved access. The software code is free and open source under the MIT license, enabling institutions to install and users to expand and modify the application for their own purposes.Availability and implementationPatientExploreR can be freely obtained from GitHub: https://github.com/BenGlicksberg/PatientExploreR. We provide instructions for how researchers with approved access to their institutional EHR can use this package. We also release an open sandbox server of synthesized patient data for users without EHR access to explore: http://patientexplorer.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text