scholarly journals Utilizing Machine Learning to Accelerate Automated Assignment of Backbone NMR Data

2016 ◽  
Vol 13 (1) ◽  
Author(s):  
Joel Venzke ◽  
David Mascharka ◽  
Paxten Johnson ◽  
Rachel Davis ◽  
Katie Roth ◽  
...  

Nuclear magnetic resonance (NMR) spectroscopy is a powerful method for determining three-dimensional structures of biomolecules, including proteins. The protein structure determination process requires measured NMR values to be assigned to specific amino acids in the primary protein sequence. Unfortunately, current manual techniques for the assignment of NMR data are time-consuming and susceptible to error. Many algorithms have been developed to automate the process, with various strengths and weaknesses. The algorithm described in this paper addresses the challenges of previous programs by utilizing machine learning to predict amino acid type, thereby increasing assignment speed. The program also generates place-holders to accommodate missing data and amino acids with unique chemical characteristics, namely proline. Through machine learning and residue-type tagging, the assignment process is greatly sped up, while maintaining high accuracy. KEYWORDS: Chemical Shift; Machine Learning; NMR; Artificial Intelligence; Proteins; Bioinformatics

2021 ◽  
Vol 8 (3) ◽  
pp. 103-111
Author(s):  
Krishna R Gupta ◽  
Uttam Patle ◽  
Uma Kabra ◽  
P. Mishra ◽  
Milind J Umekar

Three-dimensional protein structure prediction from amino acid sequence has been a thought-provoking task for decades, but it of pivotal importance as it provides a better understanding of its function. In recent years, the methods for prediction of protein structures have advanced considerably. Computational techniques and increase in protein sequence and structure databases have influence the laborious protein structure determination process. Still there is no single method which can predict all the protein structures. In this review, we describe the four stages of protein structure determination. We have also explored the currenttechniques used to uncover the protein structure and highpoint best suitable method for a given protein.


2002 ◽  
Vol 68 (9) ◽  
pp. 4253-4258 ◽  
Author(s):  
Song F. Lee ◽  
Scott A. Halperin ◽  
Jennifer B. Knight ◽  
Aaron Tait

ABSTRACT Acellular pertussis vaccines typically consist of antigens isolated from Bordetella pertussis, and pertussis toxin (PT) and filamentous hemagglutinin (FHA) are two prominent components. One of the disadvantages of a multiple-component vaccine is the cost associated with the production of the individual components. In this study, we constructed an in-frame fusion protein consisting of PT fragments (179 amino acids of PT subunit S1 and 180 amino acids of PT subunit S3) and a 456-amino-acid type I domain of FHA. The fusion protein was expressed by the commensal oral bacterium Streptococcus gordonii. The fusion protein was secreted into the culture medium as an expected 155-kDa protein, which was recognized by a polyclonal anti-PT antibody, a monoclonal anti-S1 antibody, and a monoclonal anti-FHA antibody. The fusion protein was purified from the culture supernatant by affinity and gel permeation chromatography. The immunogenicity of the purified fusion protein was assessed in BALB/c mice by performing parenteral and mucosal immunization experiments. When given parenterally, the fusion protein elicited a very strong antibody titer against the FHA type I domain, a moderate titer against native FHA, and a weak titer against PT. When given mucosally, it elicited a systemic response and a mucosal response to FHA and PT. In Western blots, the immune sera recognized the S1, S3, and S2 subunits of PT. These data collectively indicate that fragments of the pertussis vaccine components can be expressed in a single fusion protein by S. gordonii and that the fusion protein is immunogenic. This multivalent fusion protein approach may be used in designing a new generation of acellular pertussis vaccines.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yue Wang ◽  
Paul M. Harrison

AbstractHomopeptides (runs of one amino-acid type) are evolutionarily important since they are prone to expand/contract during DNA replication, recombination and repair. To gain insight into the genomic/proteomic traits driving their variation, we analyzed how homopeptides and homocodons (which are pure codon repeats) vary across 405 Dikarya, and probed their linkage to genome GC/AT bias and other factors. We find that amino-acid homopeptide frequencies vary diversely between clades, with the AT-rich Saccharomycotina trending distinctly. As organisms evolve, homocodon and homopeptide numbers are majorly coupled to GC/AT-bias, exhibiting a bi-furcated correlation with degree of AT- or GC-bias. Mid-GC/AT genomes tend to have markedly fewer simply because they are mid-GC/AT. Despite these trends, homopeptides tend to be GC-biased relative to other parts of coding sequences, even in AT-rich organisms, indicating they absorb AT bias less or are inherently more GC-rich. The most frequent and most variable homopeptide amino acids favour intrinsic disorder, and there are an opposing correlation and anti-correlation versus homopeptide levels for intrinsic disorder and structured-domain content respectively. Specific homopeptides show unique behaviours that we suggest are linked to inherent slippage probabilities during DNA replication and recombination, such as poly-glutamine, which is an evolutionarily very variable homopeptide with a codon repertoire unbiased for GC/AT, and poly-lysine whose homocodons are overwhelmingly made from the codon AAG.


2007 ◽  
Vol 05 (02a) ◽  
pp. 313-333 ◽  
Author(s):  
XIANG WAN ◽  
GUOHUI LIN

The success in backbone resonance sequential assignment is fundamental to three dimensional protein structure determination via Nuclear Magnetic Resonance (NMR) spectroscopy. Such a sequential assignment can roughly be partitioned into three separate steps: grouping resonance peaks in multiple spectra into spin systems, chaining the resultant spin systems into strings, and assigning these strings to non-overlapping consecutive amino acid residues in the target protein. Separately dealing with these three steps has been adopted in many existing assignment programs, and it works well on protein NMR data with close-to-ideal quality, while only moderately or even poorly on most real protein datasets, where noises as well as data degeneracies occur frequently. We propose in this work to partition the sequential assignment not by physical steps, but only virtual steps, and use their outputs to cross validate each other. The novelty lies in the places, where the ambiguities at the grouping step will be resolved in finding the highly confident strings at the chaining step, and the ambiguities at the chaining step will be resolved by examining the mappings of strings at the assignment step. In this way, all ambiguities at the sequential assignment will be resolved globally and optimally. The resultant assignment program is called Graph-based Approach for Sequential Assignment (GASA), which has been compared to several recent similar developments including PACES, RANDOM, MARS, and RIBRA. The performance comparisons with these works demonstrated that GASA is more promising for practical use.


2020 ◽  
Author(s):  
Yue Wang ◽  
Paul Harrison

Abstract Homopeptides (consecutive runs of one amino-acid type) are suggested to play important roles in proteome evolution, since they are prone to expand/contract during DNA replication, recombination and repair. It is currently not clear how homopeptide frequencies vary as organisms evolve, and which genomic/proteomic traits drive variation. Thus, to gain insight, we analyzed how homopeptides and homocodons (which are pure codon repeats) vary across 405 Dikarya, and probed how this variation is linked to GC/AT bias amongst other factors. We observe that amino-acid homopeptide frequencies vary diversely between clades (even close relatives), with the AT-rich Saccharomycotina trending distinctly. As organisms evolve, homocodon and homopeptide numbers are majorly coupled to GC/AT-bias, with medium GC/AT genomes having markedly fewer. Despite this, homopeptides tend to be more GC-rich than other proteome areas, even in AT-rich organisms, indicating they absorb AT bias less or are inherently more GC-rich. Furthermore, the purity of homopeptides (i.e., the degree one codon type predominates in them) varies least for amino acids with GC/AT-balanced codon repertoires, with most variation for arginine since it has only one AT-rich codon (out of six). The most frequent and most variable homopeptide amino acids have greater intrinsic disorder propensity, and annotated intrinsic disorder fractions are strongly correlated with homopeptide levels (unlike structured domain fractions, which are anti-correlated). Poly-glutamine uniquely behaves as an evolutionarily very variable homopeptide with a codon repertoire unbiased for GC/AT. In summary, homopeptide/homocodon levels are coupled to or influenced by several factors, including GC/AT bias and amino-acid intrinsic disorder propensity.


2020 ◽  
Vol 15 (2) ◽  
pp. 121-134 ◽  
Author(s):  
Eunmi Kwon ◽  
Myeongji Cho ◽  
Hayeon Kim ◽  
Hyeon S. Son

Background: The host tropism determinants of influenza virus, which cause changes in the host range and increase the likelihood of interaction with specific hosts, are critical for understanding the infection and propagation of the virus in diverse host species. Methods: Six types of protein sequences of influenza viral strains isolated from three classes of hosts (avian, human, and swine) were obtained. Random forest, naïve Bayes classification, and knearest neighbor algorithms were used for host classification. The Java language was used for sequence analysis programming and identifying host-specific position markers. Results: A machine learning technique was explored to derive the physicochemical properties of amino acids used in host classification and prediction. HA protein was found to play the most important role in determining host tropism of the influenza virus, and the random forest method yielded the highest accuracy in host prediction. Conserved amino acids that exhibited host-specific differences were also selected and verified, and they were found to be useful position markers for host classification. Finally, ANOVA analysis and post-hoc testing revealed that the physicochemical properties of amino acids, comprising protein sequences combined with position markers, differed significantly among hosts. Conclusion: The host tropism determinants and position markers described in this study can be used in related research to classify, identify, and predict the hosts of influenza viruses that are currently susceptible or likely to be infected in the future.


Sign in / Sign up

Export Citation Format

Share Document