MATLAB software for extracting protein name and sequence information from FASTA formatted proteome file

10.7287/peerj.preprints.27856v2 ◽

2019 ◽

Author(s):

Wenfa Ng

Keyword(s):

Molecular Weight ◽

Amino Acid ◽

Nucleotide Sequence ◽

Amino Acid Sequence ◽

Sequence Information ◽

Amino Acid Residues ◽

Protein Database ◽

Matlab Software ◽

Amino Acid Sequence Information ◽

New Protein

FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed. Collectively, given that important information such as protein name is enmeshed in a character array in fasta proteome file, this work sets out to develop a MATLAB software that could automatically extract protein name and amino acid sequence information, and assigns them to a new protein database. Using built-in functions, number of amino acid residues, molecular weight and nucleotide sequence of each protein were calculated; thereby, yielding a new protein database with improved functionalities that could support a variety of biology workflows ranging from sequence alignment to molecular cloning.

Download Full-text

MATLAB software for extracting protein name and sequence information from FASTA formatted proteome file

10.7287/peerj.preprints.27856v1 ◽

2019 ◽

Author(s):

Wenfa Ng

Keyword(s):

Molecular Weight ◽

Amino Acid ◽

Nucleotide Sequence ◽

Amino Acid Sequence ◽

Sequence Information ◽

Amino Acid Residues ◽

Protein Database ◽

Matlab Software ◽

Amino Acid Sequence Information ◽

New Protein

FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed. Collectively, given that important information such as protein name is enmeshed in a character array in fasta proteome file, this work sets out to develop a MATLAB software that could automatically extract protein name and amino acid sequence information, and assigns them to a new protein database. Using built-in functions, number of amino acid residues, molecular weight and nucleotide sequence of each protein were calculated; thereby, yielding a new protein database with improved functionalities that could support a variety of biology workflows ranging from sequence alignment to molecular cloning.

Download Full-text

Ribosomal protein database profiling lends clarity to ribosomal protein evolution and mass distribution

10.1101/2021.10.25.465821 ◽

2021 ◽

Author(s):

Wenfa Ng

Keyword(s):

Mass Spectrometry ◽

Molecular Weight ◽

Amino Acid ◽

Ribosomal Protein ◽

Ribosomal Proteins ◽

Sequence Information ◽

Protein Database ◽

Microbial Identification ◽

Microbial Species ◽

Amino Acid Sequence Information

Existence of theoretical ribosomal protein mass fingerprint as well as utility of ribosomal protein as biomarkers in mass spectrometry microbial identification suggests phylogenetic significance for this class of proteins. To serve the above two functions, facile means of identifying and extracting important attributes of ribosomal proteins from proteome data file of microbial species must be found. Additionally, there is a need to calculate important properties of ribosomal proteins such as molecular weight and nucleotide sequence based on amino acid sequence information from FASTA proteome file. This work sought to support the above endeavour through developing a MATLAB software that extracts the amino acid sequence information of all ribosomal proteins from the FASTA proteome datafile of a microbial species downloaded from UniProt. Built-in functions in MATLAB are subsequently employed to calculate important properties of extracted ribosomal proteins such as number of amino acid residue, molecular weight and nucleotide sequence. All information above are output, as a database, to an Excel file for ease of storage and retrieval. Data available from the analysis of an Escherichia coli K-12 proteome revealed that the bacterium possess a total of 59 ribosomal proteins distributed between the large and small ribosome subunits. The ribosomal protein ranges in sequence length from 38 (50S ribosomal protein L36) to 557 (30S ribosomal protein S1). In terms of molecular weight distribution, the profiled ribosomal proteins range in weight from 4364.305 Da (50S ribosomal protein L36) to 61157.66 Da (30S ribosomal protein S1). More important, analysis of the distribution of the molecular weight of different ribosomal proteins in E. coli reveals a smooth curve that suggests strong co-evolution of ribosomal protein sequence and mass given the tight constraints that a functional ribosome presents. Finally, cluster analysis reveals a preponderance of small ribosomal proteins compared to larger ones, which remains to be a mystery to evolutionary biologists. Overall, the information encapsulated in the ribosomal protein database should find use in gaining a better appreciation for the molecular weight distribution of ribosomal proteins in a species, as well as delivering information for using ribosomal protein biomarkers in identifying particular microbial species in mass spectrometry microbial identification.

Download Full-text

Faculty Opinions recommendation of A study of archaeal enzymes involved in polar lipid synthesis linking amino acid sequence information, genomic contexts and lipid composition.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1028632.342399 ◽

2005 ◽

Author(s):

Robert Michell

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Polar Lipid ◽

Lipid Composition ◽

Lipid Synthesis ◽

Sequence Information ◽

Amino Acid Sequence Information

Download Full-text

Amino acid sequence information in proteins and complex proteinaceous material revealed by pyrolysis-capillary gas chromatography-low and high resolution mass spectrometry

Journal of Analytical and Applied Pyrolysis ◽

10.1016/0165-2370(87)85038-6 ◽

1987 ◽

Vol 11 ◽

pp. 313-327 ◽

Cited By ~ 75

Author(s):

Jaap J. Boon ◽

J.W. De Leeuw

Keyword(s):

Mass Spectrometry ◽

Gas Chromatography ◽

Amino Acid ◽

High Resolution ◽

Amino Acid Sequence ◽

Capillary Gas Chromatography ◽

High Resolution Mass Spectrometry ◽

Sequence Information ◽

Amino Acid Sequence Information ◽

Resolution Mass

Download Full-text

Epitope mapping by a method that requires no amino acid sequence information

Analytical Biochemistry ◽

10.1016/0003-2697(92)90596-y ◽

1992 ◽

Vol 205 (1) ◽

pp. 179-182 ◽

Cited By ~ 7

Author(s):

Jie Yuan ◽

Philip S. Low

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Epitope Mapping ◽

Sequence Information ◽

Amino Acid Sequence Information

Download Full-text

Molecular Identification of Family 38 α-Mannosidase of Bacillus sp. Strain GL1, Responsible for Complete Depolymerization of Xanthan

Applied and Environmental Microbiology ◽

10.1128/aem.68.6.2731-2736.2002 ◽

2002 ◽

Vol 68 (6) ◽

pp. 2731-2736 ◽

Cited By ~ 15

Author(s):

Hirokazu Nankai ◽

Wataru Hashimoto ◽

Kousaku Murata

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Cell Extract ◽

Amino Acid Sequences ◽

Glycoside Hydrolase Family ◽

Sequence Information ◽

Reading Frame ◽

A Cell ◽

Terminal Amino ◽

Amino Acid Sequence Information

ABSTRACT When cells of Bacillus sp. strain GL1 were grown in a medium containing xanthan as a carbon source, α-mannosidase exhibiting activity toward p-nitrophenyl-α-d-mannopyranoside (pNP-α-d-Man) was produced intracellularly. The 350-kDa α-mannosidase purified from a cell extract of the bacterium was a trimer comprising three identical subunits, each with a molecular mass of 110 kDa. The enzyme hydrolyzed pNP-α-d-Man (Km = 0.49 mM) and d-mannosyl-(α-1,3)-d-glucose most efficiently at pH 7.5 to 9.0, indicating that the enzyme catalyzes the last step of the xanthan depolymerization pathway of Bacillus sp. strain GL1. The gene for α-mannosidase cloned most by using N-terminal amino acid sequence information contained an open reading frame (3,144 bp) capable of coding for a polypeptide with a molecular weight of 119,239. The deduced amino acid sequence showed homology with the amino acid sequences of α-mannosidases belonging to glycoside hydrolase family 38.

Download Full-text

Amino acid sequence of the acidic acrosin inhibitor (BUSI I B2) from bull seminal plasma

Collection of Czechoslovak Chemical Communications ◽

10.1135/cccc19832558 ◽

1983 ◽

Vol 48 (9) ◽

pp. 2558-2568 ◽

Cited By ~ 5

Author(s):

Bedřich Meloun ◽

Věra Jonáková ◽

Dana Čechová

Keyword(s):

Molecular Weight ◽

Amino Acid ◽

Amino Acid Sequence ◽

Seminal Plasma ◽

Sequential Data ◽

Amino Acid Residues

The molecule of the inhibitor consists of 63 amino acid residues whose sequence is the following: Glu-Ile-Tyr-Phe-Glu-Pro-Asp-Phe-Gly-Phe-Pro-Pro-Asp-Cys-Lys-Val-Tyr-Thr-Glu-Ala-Cys-Thr-Arg-Glu-Tyr-Asn-Pro-Ile-Cys-Asp-Ser-Ala-Ala-Lys-Thr-Tyr-Ser-Asn-Glu-Cys-Thr-Phe-Cys-Asn-Glu-Lys-Met-Asn-Asn-Asp-Ala-Asp-Ile-His-Phe-Gln-His-Phe-Gly-Glu-Cys-Glu-Tyr. The sequential data were obtained by the analysis of peptides isolated from the tryptic and chymotryptic digest of the carboxymethylated inhibitor. The molecular weight of the inhibitor calculated from its amino acid sequence is 7377.

Download Full-text