amino acid sequence information
Recently Published Documents


TOTAL DOCUMENTS

26
(FIVE YEARS 9)

H-INDEX

9
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Wenfa Ng

Existence of theoretical ribosomal protein mass fingerprint as well as utility of ribosomal protein as biomarkers in mass spectrometry microbial identification suggests phylogenetic significance for this class of proteins. To serve the above two functions, facile means of identifying and extracting important attributes of ribosomal proteins from proteome data file of microbial species must be found. Additionally, there is a need to calculate important properties of ribosomal proteins such as molecular weight and nucleotide sequence based on amino acid sequence information from FASTA proteome file. This work sought to support the above endeavour through developing a MATLAB software that extracts the amino acid sequence information of all ribosomal proteins from the FASTA proteome datafile of a microbial species downloaded from UniProt. Built-in functions in MATLAB are subsequently employed to calculate important properties of extracted ribosomal proteins such as number of amino acid residue, molecular weight and nucleotide sequence. All information above are output, as a database, to an Excel file for ease of storage and retrieval. Data available from the analysis of an Escherichia coli K-12 proteome revealed that the bacterium possess a total of 59 ribosomal proteins distributed between the large and small ribosome subunits. The ribosomal protein ranges in sequence length from 38 (50S ribosomal protein L36) to 557 (30S ribosomal protein S1). In terms of molecular weight distribution, the profiled ribosomal proteins range in weight from 4364.305 Da (50S ribosomal protein L36) to 61157.66 Da (30S ribosomal protein S1). More important, analysis of the distribution of the molecular weight of different ribosomal proteins in E. coli reveals a smooth curve that suggests strong co-evolution of ribosomal protein sequence and mass given the tight constraints that a functional ribosome presents. Finally, cluster analysis reveals a preponderance of small ribosomal proteins compared to larger ones, which remains to be a mystery to evolutionary biologists. Overall, the information encapsulated in the ribosomal protein database should find use in gaining a better appreciation for the molecular weight distribution of ribosomal proteins in a species, as well as delivering information for using ribosomal protein biomarkers in identifying particular microbial species in mass spectrometry microbial identification.


2021 ◽  
Vol 22 (S3) ◽  
Author(s):  
Toshitaka Tanebe ◽  
Takashi Ishida

Abstract Background Recently, machine learning-based ligand activity prediction methods have been greatly improved. However, if known active compounds of a target protein are unavailable, the machine learning-based method cannot be applied. In such cases, docking simulation is generally applied because it only requires a tertiary structure of the target protein. However, the conformation search and the evaluation of binding energy of docking simulation are computationally heavy and thus docking simulation needs huge computational resources. Thus, if we can apply a machine learning-based activity prediction method for a novel target protein, such methods would be highly useful. Recently, Tsubaki et al. proposed an end-to-end learning method to predict the activity of compounds for novel target proteins. However, the prediction accuracy of the method was still insufficient because it only used amino acid sequence information of a protein as the input. Results In this research, we proposed an end-to-end learning-based compound activity prediction using structure information of a binding pocket of a target protein. The proposed method learns the important features by end-to-end learning using a graph neural network both for a compound structure and a protein binding pocket structure. As a result of the evaluation experiments, the proposed method has shown higher accuracy than an existing method using amino acid sequence information. Conclusions The proposed method achieved equivalent accuracy to docking simulation using AutoDock Vina with much shorter computing time. This indicated that a machine learning-based approach would be promising even for novel target proteins in activity prediction.


2020 ◽  
Vol 21 (S16) ◽  
Author(s):  
Leilei Liu ◽  
Xianglei Zhu ◽  
Yi Ma ◽  
Haiyin Piao ◽  
Yaodong Yang ◽  
...  

Abstract Background Protein–protein interactions (PPIs) are of great importance in cellular systems of organisms, since they are the basis of cellular structure and function and many essential cellular processes are related to that. Most proteins perform their functions by interacting with other proteins, so predicting PPIs accurately is crucial for understanding cell physiology. Results Recently, graph convolutional networks (GCNs) have been proposed to capture the graph structure information and generate representations for nodes in the graph. In our paper, we use GCNs to learn the position information of proteins in the PPIs networks graph, which can reflect the properties of proteins to some extent. Combining amino acid sequence information and position information makes a stronger representation for protein, which improves the accuracy of PPIs prediction. Conclusion In previous research methods, most of them only used protein amino acid sequence as input information to make predictions, without considering the structural information of PPIs networks graph. We first time combine amino acid sequence information and position information to make representations for proteins. The experimental results indicate that our method has strong competitiveness compared with several sequence-based methods.


2020 ◽  
Author(s):  
Tatjana Skrbic ◽  
Amos Maritan ◽  
Achille Giacometti ◽  
George D. Rose ◽  
Jayanth R. Banavar

The native state structures of globular proteins are stable and well-packed indicating that self-interactions are favored over protein-solvent interactions under folding conditions. We use this as a guiding principle to derive the geometry of the building blocks of protein structures, alpha-helices and strands assembled into beta-sheets, with no adjustable parameters, no amino acid sequence information, and no chemistry. There is an almost perfect fit between the dictates of mathematics and physics and the rules of quantum chemistry. Our theory establishes an energy landscape that channels protein evolution by providing sequence-independent platforms for elaborating sequence-dependent functional diversity. Our work highlights the vital role of discreteness in life and has implications for the creation of artificial life and on the nature of life elsewhere in the cosmos.


Biomolecules ◽  
2020 ◽  
Vol 10 (6) ◽  
pp. 938
Author(s):  
Kriti Chopra ◽  
Bhawna Burdak ◽  
Kaushal Sharma ◽  
Ajit Kembhavi ◽  
Shekhar C. Mande ◽  
...  

Decrypting the interface residues of the protein complexes provides insight into the functions of the proteins and, hence, the overall cellular machinery. Computational methods have been devised in the past to predict the interface residues using amino acid sequence information, but all these methods have been majorly applied to predict for prokaryotic protein complexes. Since the composition and rate of evolution of the primary sequence is different between prokaryotes and eukaryotes, it is important to develop a method specifically for eukaryotic complexes. Here, we report a new hybrid pipeline for predicting the protein-protein interaction interfaces in a pairwise manner from the amino acid sequence information of the interacting proteins. It is based on the framework of Co-evolution, machine learning (Random Forest), and Network Analysis named CoRNeA trained specifically on eukaryotic protein complexes. We use Co-evolution, physicochemical properties, and contact potential as major group of features to train the Random Forest classifier. We also incorporate the intra-contact information of the individual proteins to eliminate false positives from the predictions keeping in mind that the amino acid sequence of a protein also holds information for its own folding and not only the interface propensities. Our prediction on example datasets shows that CoRNeA not only enhances the prediction of true interface residues but also reduces false positive rates significantly.


2019 ◽  
Author(s):  
Kriti Chopra ◽  
Bhawna Burdak ◽  
Kaushal Sharma ◽  
Ajit Kembavi ◽  
Shekhar C. Mande ◽  
...  

AbstractComputational methods have been devised in the past to predict the interface residues using amino acid sequence information but have been majorly applied to predict for prokaryotic protein complexes. Since the composition and rate of evolution of the primary sequence are different between prokaryotes and eukaryotes, it is important to develop a method specifically for eukaryotic complexes. Here we report a new hybrid pipeline for the prediction of protein-protein interaction interfaces from the amino acid sequence information alone based on the framework of Co-evolution, machine learning (Random forest) and Network Analysis named CoRNeA trained specifically on eukaryotic protein complexes. We incorporate the intra contact information of the individual proteins to eliminate false positives from the predictions as the amino acid sequence also holds information for its own folding along with the interface propensities. Our prediction on various case studies shows that CoRNeA can successfully identify minimal interacting regions of two partner proteins with higher precision and recall.


2019 ◽  
Author(s):  
Wenfa Ng

FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed. Collectively, given that important information such as protein name is enmeshed in a character array in fasta proteome file, this work sets out to develop a MATLAB software that could automatically extract protein name and amino acid sequence information, and assigns them to a new protein database. Using built-in functions, number of amino acid residues, molecular weight and nucleotide sequence of each protein were calculated; thereby, yielding a new protein database with improved functionalities that could support a variety of biology workflows ranging from sequence alignment to molecular cloning.


2019 ◽  
Author(s):  
Wenfa Ng

FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed. Collectively, given that important information such as protein name is enmeshed in a character array in fasta proteome file, this work sets out to develop a MATLAB software that could automatically extract protein name and amino acid sequence information, and assigns them to a new protein database. Using built-in functions, number of amino acid residues, molecular weight and nucleotide sequence of each protein were calculated; thereby, yielding a new protein database with improved functionalities that could support a variety of biology workflows ranging from sequence alignment to molecular cloning.


2019 ◽  
Author(s):  
Wenfa Ng

FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed. Collectively, given that important information such as protein name is enmeshed in a character array in fasta proteome file, this work sets out to develop a MATLAB software that could automatically extract protein name and amino acid sequence information, and assigns them to a new protein database. Using built-in functions, number of amino acid residues, molecular weight and nucleotide sequence of each protein were calculated; thereby, yielding a new protein database with improved functionalities that could support a variety of biology workflows ranging from sequence alignment to molecular cloning.


2018 ◽  
Author(s):  
Wenfa Ng

Ribosomes are highly conserved macromolecular machines whose critical function is protein synthesis. However, existence of unique molecular mass of the same type of ribosomal protein for individual species in the same domain of life raises the interesting question concerning the interaction between natural selection forces and the conservation of structure and function of ribosomal proteins. Thus, given differentiated molecular mass and sequence of ribosomal proteins across species, the structures of ribosomes are correspondingly differentiated even though the general structure and function of the macromolecular machine is conserved across species in the same domain of life. The collection of molecular mass of all ribosomal proteins in the large and small ribosome subunits can be understood as the ribosomal protein mass fingerprint of the species useful for gaining fundamental knowledge of ribosomal proteins, as well as serving as tools for species identification through comparison of ribosomal protein mass spectra. This preprint introduces the Theoretical Ribosomal Protein Mass Fingerprint database that comprises the theoretical molecular mass of all ribosomal proteins of a species calculated based on available amino acid sequence information of the ribosomal proteins. Using amino acid sequence information from the Ribosomal Protein Gene Database, the Theoretical Ribosomal Protein Mass Fingerprint database ( https://ngwenfa.wordpress.com/database/ ) spans species from cyanobacteria, fungus, bacteria, archaea, nematodes, diatoms, micro-algae, and various model organisms. The database should be useful as a resource for gaining fundamental understanding of the mass distribution of ribosomal proteins of a species, or serving as a limited reference database for identifying species based on comparing experimental ribosomal protein mass fingerprint of unknown species against theoretically calculated ones of known species. Future expansion of the database will aim to catalogue the theoretical ribosomal protein mass fingerprint of more microbial species using amino acid sequence information from UniProt.


Sign in / Sign up

Export Citation Format

Share Document