Classification of Amino-Acid Sequences Using State-Space Models

Molecular sequence information about viruses has mostly confirmed the groupings devised by traditional taxonomic methods, but shown in addition that the genes of related species may differ in number, arrangement, orientation and in sequence homology. It has also revealed that true genetic recombination between viruses has been common, even among those with RNA genomes, indeed most virus groups seem to have arisen y recombination. Thus, there is an unexpected wealth of genetic chaos hidden behind the fatade of the phenotype, and it is possible that the difficulties that plant taxonomists have had in identifying the relationships of the major groupings of plants could have similar causes. Nonetheless, molecular taxonomy does give sensible results and this is illustrated by a classification of the large subunit Rubisco proteins of 21 plant species based on their amino acid sequences.

Download Full-text

Ambiguity coding allows accurate inference of evolutionary parameters from alignments in an aggregated state-space

10.1101/802603 ◽

2019 ◽

Author(s):

Claudia C. Weber ◽

Umberto Perron ◽

Dearbhaile Casey ◽

Ziheng Yang ◽

Nick Goldman

Keyword(s):

Amino Acid ◽

State Space ◽

Amino Acid Sequences ◽

Side Chain ◽

Parameter Estimates ◽

Amino Acid Side Chain ◽

Ancestral Reconstruction ◽

Codon Model ◽

Reconstruction Performance ◽

Chain Configuration

How can we best learn the history of a protein’s evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modelling based on inferred amino acid sequence and side chain configuration).But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the “unseen” state-space. To illustrate this, we consider two examples with amino acid sequences as input.We show that ω, a parameter describing the relative strength of selection on non-synonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible.

Download Full-text

Functional classification of proteins based on projection of amino acid sequences: application for prediction of protein kinase substrates

BMC Bioinformatics ◽

10.1186/1471-2105-11-313 ◽

2010 ◽

Vol 11 (1) ◽

Cited By ~ 13

Author(s):

Boris Sobolev ◽

Dmitry Filimonov ◽

Alexey Lagunin ◽

Alexey Zakharov ◽

Olga Koborova ◽

...

Keyword(s):

Amino Acid ◽

Protein Kinase ◽

Amino Acid Sequences ◽

Functional Classification ◽

Kinase Substrates

Download Full-text

Ambiguity Coding Allows Accurate Inference of Evolutionary Parameters from Alignments in an Aggregated State-Space

Systematic Biology ◽

10.1093/sysbio/syaa036 ◽

2020 ◽

Vol 70 (1) ◽

pp. 21-32

Author(s):

Claudia C Weber ◽

Umberto Perron ◽

Dearbhaile Casey ◽

Ziheng Yang ◽

Nick Goldman

Keyword(s):

Amino Acid ◽

State Space ◽

Amino Acid Sequences ◽

Side Chain ◽

Parameter Estimates ◽

Amino Acid Side Chain ◽

Ancestral Reconstruction ◽

Codon Model ◽

Reconstruction Performance ◽

Chain Configuration

Abstract How can we best learn the history of a protein’s evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modeling based on inferred amino acid sequence and side chain configuration). But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the “unseen” state-space. To illustrate this, we consider two examples with amino acid sequences as input. We show that $$\omega$$, a parameter describing the relative strength of selection on nonsynonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible. [Ancestral reconstruction; natural selection; protein structure; state-spaces; substitution models.]

Download Full-text

Bacterial lipolytic enzymes: classification and properties

Biochemical Journal ◽

10.1042/bj3430177 ◽

1999 ◽

Vol 343 (1) ◽

pp. 177-183 ◽

Cited By ~ 507

Author(s):

Jean Louis ARPIGNY ◽

Karl-Erich JAEGER

Keyword(s):

Amino Acid ◽

Structural Features ◽

Biological Properties ◽

Amino Acid Sequences ◽

Important Class ◽

Lipolytic Enzymes ◽

Disulphide Bonds ◽

Enzyme Families

Knowledge of bacterial lipolytic enzymes is increasing at a rapid and exciting rate. To obtain an overview of this industrially very important class of enzymes and their characteristics, we have collected and classified the information available from protein and nucleotide databases. Here we propose an updated and extensive classification of bacterial esterases and lipases based mainly on a comparison of their amino acid sequences and some fundamental biological properties. These new insights result in the identification of eight different families with the largest being further divided into six subfamilies. Moreover, the classification enables us to predict (1) important structural features such as residues forming the catalytic site or the presence of disulphide bonds, (2) types of secretion mechanism and requirement for lipase-specific foldases, and (3) the potential relationship to other enzyme families. This work will therefore contribute to a faster identification and to an easier characterization of novel bacterial lipolytic enzymes.

Download Full-text

Identification and genome characterization of Heliothis armigera cypovirus types 5 and 14 and Heliothis assulta cypovirus type 14

Journal of General Virology ◽

10.1099/vir.0.81435-0 ◽

2006 ◽

Vol 87 (2) ◽

pp. 387-394 ◽

Cited By ~ 17

Author(s):

Yang Li ◽

Li Tan ◽

Yanqiu Li ◽

Wuguo Chen ◽

Jiamin Zhang ◽

...

Keyword(s):

Amino Acid ◽

Heliothis Armigera ◽

Amino Acid Sequences ◽

Nucleotide Sequencing ◽

Sequencing Analysis ◽

Electrophoretic Patterns ◽

The Family ◽

Genomic Characterization

Genomic characterization of Heliothis armigera cypovirus (HaCPV) isolated from China showed that insects were co-infected with several cypoviruses (CPVs). One of the CPVs (HaCPV-5) could be separated from the others by changing the rearing conditions of the Heliothis armigera larvae. This finding was further confirmed by nucleotide sequencing analysis. Genomic sequences of segments S10–S7 from HaCPV-14, S10 and S7 from HaCPV-5, and S10 from Heliothis assulta CPV-14 were compared. Results from database searches showed that the nucleotide sequences and deduced amino acid sequences of the newly identified CPVs had high levels of identity with those of reported CPVs of the same type, but not with CPVs of different types. Putative amino acid sequences of HaCPV-5 S7 were similar to that of the protein from Rice ragged stunt virus (genus Oryzavirus, family Reoviridae), suggesting that CPVs and oryzaviruses are related more closely than other genera of the family Reoviridae. Conserved motifs were also identified at the ends of each RNA segment of the same virus type: type 14, 5′-AGAAUUU…CAGCU-3′; and type 5, 5′-AGUU…UUGC-3′. Our results are consistent with classification of CPV types based on the electrophoretic patterns of CPV double-stranded RNA.

Download Full-text

An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection

Fuzzy Sets and Systems ◽

10.1016/j.fss.2004.10.011 ◽

2005 ◽

Vol 152 (1) ◽

pp. 5-16 ◽

Cited By ~ 45

Author(s):

Sanghamitra Bandyopadhyay

Keyword(s):

Feature Extraction ◽

Amino Acid ◽

Fuzzy Clustering ◽

Amino Acid Sequences ◽

Efficient Technique ◽

Prototype Selection

Download Full-text

A classification of glycosyl hydrolases based on amino acid sequence similarities

Biochemical Journal ◽

10.1042/bj2800309 ◽

1991 ◽

Vol 280 (2) ◽

pp. 309-316 ◽

Cited By ~ 1997

Author(s):

B Henrissat

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Mechanism Of Action ◽

Classification System ◽

Structural Data ◽

Amino Acid Sequences ◽

Steady Increase ◽

Glycosyl Hydrolases ◽

Sequence Similarities

The amino acid sequences of 301 glycosyl hydrolases and related enzymes have been compared. A total of 291 sequences corresponding to 39 EC entries could be classified into 35 families. Only ten sequences (less than 5% of the sample) could not be assigned to any family. With the sequences available for this analysis, 18 families were found to be monospecific (containing only one EC number) and 17 were found to be polyspecific (containing at least two EC numbers). Implications on the folding characteristics and mechanism of action of these enzymes and on the evolution of carbohydrate metabolism are discussed. With the steady increase in sequence and structural data, it is suggested that the enzyme classification system should perhaps be revised.

Download Full-text

A Structure-Based Classification of Class A β-Lactamases, a Broadly Diverse Family of Enzymes

Clinical Microbiology Reviews ◽

10.1128/cmr.00019-15 ◽

2015 ◽

Vol 29 (1) ◽

pp. 29-57 ◽

Cited By ~ 42

Author(s):

Alain Philippon ◽

Patrick Slama ◽

Paul Dény ◽

Roger Labia

Keyword(s):

Amino Acid ◽

Catalytic Properties ◽

Amino Acid Sequences ◽

Conserved Residues ◽

X Ray ◽

Class A ◽

Rapid Changes ◽

Major Branch ◽

Limited Spectrum

SUMMARYFor medical biologists, sequencing has become a commonplace technique to support diagnosis. Rapid changes in this field have led to the generation of large amounts of data, which are not always correctly listed in databases. This is particularly true for data concerning class A β-lactamases, a group of key antibiotic resistance enzymes produced by bacteria. Many genomes have been reported to contain putative β-lactamase genes, which can be compared with representative types. We analyzed several hundred amino acid sequences of class A β-lactamase enzymes for phylogenic relationships, the presence of specific residues, and cluster patterns. A clear distinction was first made between dd-peptidases and class A enzymes based on a small number of residues (S70, K73, P107, 130SDN132, G144, E166, 234K/R, 235T/S, and 236G [Ambler numbering]). Other residues clearly separated two main branches, which we named subclasses A1 and A2. Various clusters were identified on the major branch (subclass A1) on the basis of signature residues associated with catalytic properties (e.g., limited-spectrum β-lactamases, extended-spectrum β-lactamases, and carbapenemases). For subclass A2 enzymes (e.g., CfxA, CIA-1, CME-1, PER-1, and VEB-1), 43 conserved residues were characterized, and several significant insertions were detected. This diversity in the amino acid sequences of β-lactamases must be taken into account to ensure that new enzymes are accurately identified. However, with the exception of PER types, this diversity is poorly represented in existing X-ray crystallographic data.

Download Full-text