Establishing the minimality of phylogenetic trees from protein sequences

Phylogenetic relationships within the family Halobacteriaceae inferred from rpoB′ gene and protein sequences

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.65190-0 ◽

2007 ◽

Vol 57 (10) ◽

pp. 2289-2295 ◽

Cited By ~ 33

Author(s):

Madalin Enache ◽

Takashi Itoh ◽

Tadamasa Fukushima ◽

Ron Usami ◽

Lucia Dumitru ◽

...

Keyword(s):

16S Rrna ◽

Molecular Marker ◽

16S Rrna Gene ◽

Phylogenetic Trees ◽

Protein Sequences ◽

Rpob Gene ◽

Rrna Gene ◽

Gene Sequences ◽

16S Rrna Gene Sequences ◽

The Family

In order to clarify the current phylogeny of the haloarchaea, particularly the closely related genera that have been difficult to sort out using 16S rRNA gene sequences, the DNA-dependent RNA polymerase subunit B′ gene (rpoB′) was used as a complementary molecular marker. Partial sequences of the gene were determined from 16 strains of the family Halobacteriaceae. Comparisons of phylogenetic trees inferred from the gene and protein sequences as well as from corresponding 16S rRNA gene sequences suggested that species of the genera Natrialba, Natronococcus, Halobiforma, Natronobacterium, Natronorubrum, Natrinema/Haloterrigena and Natronolimnobius formed a monophyletic group in all trees. In the RpoB′ protein tree, the alkaliphilic species Natrialba chahannaoensis, Natrialba hulunbeirensis and Natrialba magadii formed a tight group, while the neutrophilic species Natrialba asiatica formed a separate group with species of the genera Natronorubrum and Natronolimnobius. Species of the genus Natronorubrum were split into two groups in both the rpoB′ gene and protein trees. The most important advantage of the use of the rpoB′ gene over the 16S rRNA gene is that sequences of the former are highly conserved amongst species of the family Halobacteriaceae. All sequences determined so far can be aligned unambiguously without any gaps. On the other hand, gaps are necessary at 49 positions in the inner part of the alignment of 16S rRNA gene sequences. The rpoB′ gene and protein sequences can be used as an excellent alternative molecular marker in phylogenetic analysis of the Halobacteriaceae.

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Analysis of SARS-CoV-2 nucleocapsid protein sequence variations in ASEAN countries

Medical Journal of Indonesia ◽

10.13181/mji.oa.215304 ◽

2021 ◽

Author(s):

Mochammad Rajasa Mukti Negara ◽

Ita Krissanti ◽

Gita Widya Pradini

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Protein Sequences ◽

Reference Sequence ◽

N Protein ◽

Asean Country ◽

Sequence Variations ◽

Complete Sequences ◽

Asean Countries ◽

Global Initiative

BACKGROUND Nucleocapsid (N) protein is one of four structural proteins of SARS-CoV-2 which is known to be more conserved than spike protein and is highly immunogenic. This study aimed to analyze the variation of the SARS-CoV-2 N protein sequences in ASEAN countries, including Indonesia. METHODS Complete sequences of SARS-CoV-2 N protein from each ASEAN country were obtained from Global Initiative on Sharing All Influenza Data (GISAID), while the reference sequence was obtained from GenBank. All sequences collected from December 2019 to March 2021 were grouped to the clade according to GISAID, and two representative isolates were chosen from each clade for the analysis. The sequences were aligned by MUSCLE, and phylogenetic trees were built using MEGA-X software based on the nucleotide and translated AA sequences. RESULTS 98 isolates of complete N protein genes from ASEAN countries were analyzed. The nucleotides of all isolates were 97.5% conserved. Of 31 nucleotide changes, 22 led to amino acid (AA) substitutions; thus, the AA sequences were 94.5% conserved. The phylogenetic tree of nucleotide and AA sequences shows similar branches. Nucleotide variations in clade O (C28311T); clade GR (28881–28883 GGG>AAC); and clade GRY (28881–28883 GGG>AAC and C28977T) lead to specific branches corresponding to the clade within both trees. CONCLUSIONS The N protein sequences of SARS-CoV-2 across ASEAN countries are highly conserved. Most isolates were closely related to the reference sequence originating from China, except the isolates representing clade O, GR, and GRY which formed specific branches in the phylogenetic tree.

Novel Molecular Synapomorphies Demarcate Different Main Groups/Subgroups of Plasmodium and Piroplasmida Species Clarifying Their Evolutionary Relationships

Genes ◽

10.3390/genes10070490 ◽

2019 ◽

Vol 10 (7) ◽

pp. 490 ◽

Cited By ~ 1

Author(s):

Sharma ◽

Gupta

Keyword(s):

Phylogenetic Analysis ◽

Phylogenetic Trees ◽

Plasmodium Species ◽

Protein Sequences ◽

Evolutionary Relationships ◽

Major Life ◽

Comparative Analyses ◽

Large Numbers ◽

Conserved Signature Indels ◽

Relationship Of

The class Hematozoa encompasses several clinically important genera, including Plasmodium, whose members cause the major life-threating disease malaria. Hence, a good understanding of the interrelationships of organisms from this class and reliable means for distinguishing them are of much importance. This study reports comprehensive phylogenetic and comparative analyses on protein sequences on the genomes of 28 hematozoa species to understand their interrelationships. In addition to phylogenetic trees based on two large datasets of protein sequences, detailed comparative analyses were carried out on the genomes of hematozoa species to identify novel molecular synapomorphies consisting of conserved signature indels (CSIs) in protein sequences. These studies have identified 79 CSIs that are exclusively present in specific groups of Hematozoa/Plasmodium species, also supported by phylogenetic analysis, providing reliable means for the identification of these species groups and understanding their interrelationships. Of these CSIs, six CSIs are specifically shared by all hematozoa species, two CSIs serve to distinguish members of the order Piroplasmida, five CSIs are uniquely found in all Piroplasmida species except B. microti and two CSIs are specific for the genus Theileria. Additionally, we also describe 23 CSIs that are exclusively present in all genome-sequenced Plasmodium species and two, nine, ten and eight CSIs which are specific for members of the Plasmodium subgenera Haemamoeba, Laverania, Vinckeia and Plasmodium (excluding P. ovale and P. malariae), respectively. Additionally, our work has identified several CSIs that support species relationships which are not evident from phylogenetic analysis. Of these CSIs, one CSI supports the ancestral nature of the avian-Plasmodium species in comparison to the mammalian-infecting groups of Plasmodium species, four CSIs strongly support a specific relationship of species between the subgenera Plasmodium and Vinckeia and three CSIs each that reliably group P. malariae with members of the subgenus Plasmodium and P. ovale within the subgenus Vinckeia, respectively. These results provide a reliable framework for understanding the evolutionary relationships among the Plasmodium/Piroplasmida species. Further, in view of the exclusivity of the described molecular markers for the indicated groups of hematozoa species, particularly large numbers of unique characteristics that are specific for all Plasmodium species, they provide important molecular tools for biochemical/genetic studies and for developing novel diagnostics and therapeutics for these organisms.

Progressive Multiple Alignment of Protein Sequences and the Construction of Phylogenetic Trees

Computer Analysis of Sequence Data ◽

10.1385/0-89603-276-0:319 ◽

1994 ◽

pp. 319-325 ◽

Cited By ~ 2

Author(s):

Aiala Reizer ◽

Jonathan Reizer

Keyword(s):

Phylogenetic Trees ◽

Protein Sequences ◽

Multiple Alignment

A Phylogenetic Analysis for Gene ITPK1

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.466-467.27 ◽

2012 ◽

Vol 466-467 ◽

pp. 27-30

Author(s):

Kun Luo ◽

Dong Hui Luo

Keyword(s):

Phylogenetic Analysis ◽

Protein Sequence ◽

Selection Pressure ◽

Phylogenetic Trees ◽

Essential Role ◽

Protein Sequences

Inositol 1,3,4-trisphosphate 5/6 kinase (ITPK1) is a pivotal enzyme in producing IP6 , a moleculae that play an essential role in many biochemistry process in mammal cells. In this paper, two phylogenetic trees are constructed based on the mRNA sequences and the protein sequences, respectively. The results indicate that the protein sequences are more conserved than mRNA sequences in primates. Although both plant and animal have an abundant distribution of ITPK1 domain, there exists a great variation in protein sequence between plant and animal. The protein-based tree reflects an evolution orders that is consistent with that of organisms evolution. Z-test of selection indicates that evolution of protein ITPK1 is caused by selection pressure.

Phylogenomic analyses and molecular signatures for the class Halobacteria and its two major clades: a proposal for division of the class Halobacteria into an emended order Halobacteriales and two new orders, Haloferacales ord. nov. and Natrialbales ord. nov., containing the novel families Haloferacaceae fam. nov. and Natrialbaceae fam. nov.

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.070136-0 ◽

2015 ◽

Vol 65 (Pt_3) ◽

pp. 1050-1069 ◽

Cited By ~ 214

Author(s):

Radhey S. Gupta ◽

Sohail Naushad ◽

Sheridan Baker

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Phylogenetic Trees ◽

Protein Sequences ◽

Rrna Gene ◽

Molecular Signatures ◽

The Novel ◽

Gene Trees ◽

Content Type ◽

Link Type

The Halobacteria constitute one of the largest groups within the Archaea . The hierarchical relationship among members of this large class, which comprises a single order and a single family, has proven difficult to determine based upon 16S rRNA gene trees and morphological and physiological characteristics. This work reports detailed phylogenetic and comparative genomic studies on >100 halobacterial (haloarchaeal) genomes containing representatives from 30 genera to investigate their evolutionary relationships. In phylogenetic trees reconstructed on the basis of 32 conserved proteins, using both neighbour-joining and maximum-likelihood methods, two major clades (clades A and B) encompassing nearly two-thirds of the sequenced haloarchaeal species were strongly supported. Clades grouping the same species/genera were also supported by the 16S rRNA gene trees and trees for several individual highly conserved proteins (RpoC, EF-Tu, UvrD, GyrA, EF-2/EF-G). In parallel, our comparative analyses of protein sequences from haloarchaeal genomes have identified numerous discrete molecular markers in the form of conserved signature indels (CSI) in protein sequences and conserved signature proteins (CSPs) that are found uniquely in specific groups of haloarchaea. Thirteen CSIs in proteins involved in diverse functions and 68 CSPs that are uniquely present in all or most genome-sequenced haloarchaea provide novel molecular means for distinguishing members of the class Halobacteria from all other prokaryotes. The members of clade A are distinguished from all other haloarchaea by the unique shared presence of two CSIs in the ribose operon protein and small GTP-binding protein and eight CSPs that are found specifically in members of this clade. Likewise, four CSIs in different proteins and five other CSPs are present uniquely in members of clade B and distinguish them from all other haloarchaea. Based upon their specific clustering in phylogenetic trees for different gene/protein sequences and the unique shared presence of large numbers of molecular signatures, members of clades A and B are indicated to be distinct from all other haloarchaea because of their uniquely shared evolutionary histories. Based upon these results, it is proposed that clades A and B be recognized as two new orders, Natrialbales ord. nov. and Haloferacales ord. nov., within the class Halobacteria , containing the novel families Natrialbaceae fam. nov. and Haloferacaceae fam. nov. Other members of the class Halobacteria that are not members of these two orders will remain part of the emended order Halobacteriales in an emended family Halobacteriaceae .

Protein sequence comparison under a new complex representation of amino acids based on their physio-chemical properties

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i1.9292 ◽

2018 ◽

Vol 7 (1.8) ◽

pp. 181

Author(s):

Jayanta Pal ◽

Soumen Ghosh ◽

Bansibadan Maji ◽

Dilip Kumar Bhattacharya

Keyword(s):

Amino Acids ◽

Protein Sequence ◽

Sequence Comparison ◽

Phylogenetic Trees ◽

Protein Sequences ◽

Complex Representation ◽

Hydrophobic Property ◽

The Real ◽

Pair Wise Comparison ◽

Protein Sequence Comparison

The paper first considers a new complex representation of amino acids of which the real parts and imaginary parts are taken respectively from hydrophilic properties and residue volumes of amino acids. Then it applies complex Fourier transform on the represented sequence of complex numbers to obtain the spectrum in the frequency domain. By using the method of ‘Inter coefficient distances’ on the spectrum obtained, it constructs phylogenetic trees of different Protein sequences. Finally on the basis of such phylogenetic trees pair wise comparison is made for such Protein sequences. The paper also obtains pair wise comparison of the same protein sequences following the same method but based on a known complex representation of amino acids, where the real and imaginary parts refer to hydrophobicity properties and residue volumes of the amino acids respectively. The results of the two methods are now compared with those of the same sequences obtained earlier by other methods. It is found that both the methods are workable, further the new complex representation is better compared to the earlier one. This shows that the hydrophilic property (polarity) is a better choice than hydrophobic property of amino acids especially in protein sequence comparison.

New Simian Immunodeficiency Virus Infecting De Brazza's Monkeys (Cercopithecus neglectus): Evidence for a Cercopithecus Monkey Virus Clade

Journal of Virology ◽

10.1128/jvi.78.14.7748-7762.2004 ◽

2004 ◽

Vol 78 (14) ◽

pp. 7748-7762 ◽

Cited By ~ 78

Author(s):

Frederic Bibollet-Ruche ◽

Elizabeth Bailes ◽

Feng Gao ◽

Xavier Pourrut ◽

Katrina L. Barlow ◽

...

Keyword(s):

Phylogenetic Trees ◽

Protein Sequences ◽

Sub Saharan Africa ◽

Primate Species ◽

Gag Protein ◽

Virus Group ◽

In The Wild ◽

Complete Sequences ◽

Sub Saharan ◽

Cercopithecus Neglectus

ABSTRACT Nearly complete sequences of simian immunodeficiency viruses (SIVs) infecting 18 different nonhuman primate species in sub-Saharan Africa have now been reported; yet, our understanding of the origins, evolutionary history, and geographic distribution of these viruses still remains fragmentary. Here, we report the molecular characterization of a lentivirus (SIVdeb) naturally infecting De Brazza's monkeys (Cercopithecus neglectus). Complete SIVdeb genomes (9,158 and 9,227 bp in length) were amplified from uncultured blood mononuclear cell DNA of two wild-caught De Brazza's monkeys from Cameroon. In addition, partial pol sequences (650 bp) were amplified from four offspring of De Brazza's monkeys originally caught in the wild in Uganda. Full-length (9,068 bp) and partial pol (650 bp) SIVsyk sequences were also amplified from Sykes's monkeys (Cercopithecus albogularis) from Kenya. Analysis of these sequences identified a new SIV clade (SIVdeb), which differed from previously characterized SIVs at 40 to 50% of sites in Pol protein sequences. The viruses most closely related to SIVdeb were SIVsyk and members of the SIVgsn/SIVmus/SIVmon group of viruses infecting greater spot-nosed monkeys (Cercopithecus nictitans), mustached monkeys (Cercopithecus cephus), and mona monkeys (Cercopithecus mona), respectively. In phylogenetic trees of concatenated protein sequences, SIVdeb, SIVsyk, and SIVgsn/SIVmus/SIVmon clustered together, and this relationship was highly significant in all major coding regions. Members of this virus group also shared the same number of cysteine residues in their extracellular envelope glycoprotein and a high-affinity AIP1 binding site (YPD/SL) in their p6 Gag protein, as well as a unique transactivation response element in their viral long terminal repeat; however, SIVdeb and SIVsyk, unlike SIVgsn, SIVmon, and SIVmus, did not encode a vpu gene. These data indicate that De Brazza's monkeys are naturally infected with SIVdeb, that this infection is prevalent in different areas of the species' habitat, and that geographically diverse SIVdeb strains cluster in a single virus group. The consistent clustering of SIVdeb with SIVsyk and the SIVmon/SIVmus/SIVgsn group also suggests that these viruses have evolved from a common ancestor that likely infected a Cercopithecus host in the distant past. The vpu gene appears to have been acquired by a subset of these Cercopithecus viruses after the divergence of SIVdeb and SIVsyk.

Determining a novel feature-space for SARS-CoV-2 sequence data

10.37044/osf.io/xt7gw ◽

2020 ◽

Author(s):

Francesco Ballesio ◽

Ali Haider Bangash ◽

Didier Barradas-Bautista ◽

Justin Barton ◽

Andrea Guarracino ◽

...

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Mhc Class I ◽

Phylogenetic Trees ◽

Data Science ◽

Sequence Data ◽

Protein Sequences ◽

Feature Space ◽

Future Research ◽

Alignment Free

The pandemicity & the ability of the SARS-COV-2 to reinfect a cured subject, among other damaging characteristics of it, took everybody by surprise. A global collaborative scientific effort was direly required to bring learned people from different niches of medicine & data science together. Such a platform was provided by COVID19 Virtual BioHackathon, organized from the 5th to the 11th of April, 2020, to ponder on the related pressing issues varying in their diversity from text mining to genomics. Under the "Machine learning" track, we determined optimal k-mer length for feature extraction, constructed continuous distributed representations for protein sequences to create phylogenetic trees in an alignment-free manner, and clustered predicted MHC class I and II binding affinity to aid in vaccine design. All the related work in available in a Github repository under an MIT license for future research.