scholarly journals Total Ortholog Median Matrix as an alternative unsupervised approach for phylogenomics based on evolutionary distance between protein coding genes

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sandra Regina Maruyama ◽  
Luana Aparecida Rogerio ◽  
Patricia Domingues Freitas ◽  
Marta Maria Geraldes Teixeira ◽  
José Marcos Chaves Ribeiro

AbstractThe increasing number of available genomic data allowed the development of phylogenomic analytical tools. Current methods compile information from single gene phylogenies, whether based on topologies or multiple sequence alignments. Generally, phylogenomic analyses elect gene families or genomic regions to construct phylogenomic trees. Here, we presented an alternative approach for Phylogenomics, named TOMM (Total Ortholog Median Matrix), to construct a representative phylogram composed by amino acid distance measures of all pairwise ortholog protein sequence pairs from desired species inside a group of organisms. The procedure is divided two main steps, (1) ortholog detection and (2) creation of a matrix with the median amino acid distance measures of all pairwise orthologous sequences. We tested this approach within three different group of organisms: Kinetoplastida protozoa, hematophagous Diptera vectors and Primates. Our approach was robust and efficacious to reconstruct the phylogenetic relationships for the three groups. Moreover, novel branch topologies could be achieved, providing insights about some phylogenetic relationships between some taxa.

2020 ◽  
Author(s):  
David Roe ◽  
Cynthia Vierra-Green ◽  
Chul-Woo Pyo ◽  
Daniel E. Geraghty ◽  
Stephen R. Spellman ◽  
...  

AbstractHuman chromosome 19q13.4 contains genes encoding killer-cell immunoglobulin-like receptors (KIR). Reported haplotype lengths range from 67 to 269 kilobases and contain 4 to 18 genes. The region has certain properties such as single nucleotide variation, structural variation, homology, and repetitive elements that make it hard to align accurately beyond single gene alleles. To the best of our knowledge, a multiple sequence alignment of KIR haplotypes has never been published or presented. Such an alignment would be useful to precisely define KIR haplotypes and loci, provide context for assigning alleles (especially fusion alleles) to genes, infer evolutionary history, impute alleles, interpret and predict co-expression, and generate markers. In order to extend the framework of KIR haplotype sequences in the human genome reference, 27 new sequences were generated including 24 haplotypes from 12 individuals of African American ancestry that were selected for genotypic diversity and novelty to the reference, to bring the total to 68 full length genomic KIR haplotype sequences. We leveraged these data and tools from our long-read KIR haplotype assembly algorithm to define and align KIR haplotypes at <5 kb resolution on average. We then used a standard alignment algorithm to refine that alignment down to single base resolution. This processing demonstrated that the high-level alignment recapitulates human-curated annotation of the human haplotypes as well as a chimpanzee haplotype. Further, assignments and alignments of gene alleles were consistent with their human curation in haplotype and allele databases. These results define KIR haplotypes as 14 loci containing 9 genes. The multiple sequence alignments have been applied in two software packages as probes to capture and annotate KIR haplotypes and as markers to genotype KIR from WGS.


Plants ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 456 ◽  
Author(s):  
Cornelius M. Kyalo ◽  
Zhi-Zhong Li ◽  
Elijah M. Mkala ◽  
Itambo Malombe ◽  
Guang-Wan Hu ◽  
...  

Streptocarpus ionanthus (Gesneriaceae) comprise nine herbaceous subspecies, endemic to Kenya and Tanzania. The evolution of Str. ionanthus is perceived as complex due to morphological heterogeneity and unresolved phylogenetic relationships. Our study seeks to understand the molecular variation within Str. ionanthus using a phylogenomic approach. We sequence the chloroplast genomes of five subspecies of Str. ionanthus, compare their structural features and identify divergent regions. The five genomes are identical, with a conserved structure, a narrow size range (170 base pairs (bp)) and 115 unique genes (80 protein-coding, 31 tRNAs and 4 rRNAs). Genome alignment exhibits high synteny while the number of Simple Sequence Repeats (SSRs) are observed to be low (varying from 37 to 41), indicating high similarity. We identify ten divergent regions, including five variable regions (psbM, rps3, atpF-atpH, psbC-psbZ and psaA-ycf3) and five genes with a high number of polymorphic sites (rps16, rpoC2, rpoB, ycf1 and ndhA) which could be investigated further for phylogenetic utility in Str. ionanthus. Phylogenomic analyses here exhibit low polymorphism within Str. ionanthus and poor phylogenetic separation, which might be attributed to recent divergence. The complete chloroplast genome sequence data concerning the five subspecies provides genomic resources which can be expanded for future elucidation of Str. ionanthus phylogenetic relationships.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3142 ◽  
Author(s):  
Kae Yi Tan ◽  
Choo Hock Tan ◽  
Lawan Chanhome ◽  
Nget Hong Tan

BackgroundThe monocled cobra (Naja kaouthia) is a medically important venomous snake in Southeast Asia. Its venom has been shown to vary geographically in relation to venom composition and neurotoxic activity, indicating vast diversity of the toxin genes within the species. To investigate the polygenic trait of the venom and its locale-specific variation, we profiled and compared the venom gland transcriptomes ofN. kaouthiafrom Malaysia (NK-M) and Thailand (NK-T) applying next-generation sequencing (NGS) technology.MethodsThe transcriptomes were sequenced on the Illumina HiSeq platform, assembled and followed by transcript clustering and annotations for gene expression and function. Pairwise or multiple sequence alignments were conducted on the toxin genes expressed. Substitution rates were studied for the major toxins co-expressed in NK-M and NK-T.Results and discussionThe toxin transcripts showed high redundancy (41–82% of the total mRNA expression) and comprised 23 gene families expressed in NK-M and NK-T, respectively (22 gene families were co-expressed). Among the venom genes, three-finger toxins (3FTxs) predominated in the expression, with multiple sequences noted. Comparative analysis and selection study revealed that 3FTxs are genetically conserved between the geographical specimens whilst demonstrating distinct differential expression patterns, implying gene up-regulation for selected principal toxins, or alternatively, enhanced transcript degradation or lack of transcription of certain traits. One of the striking features that elucidates the inter-geographical venom variation is the up-regulation of α-neurotoxins (constitutes ∼80.0% of toxin’s fragments per kilobase of exon model per million mapped reads (FPKM)), particularly the long-chain α-elapitoxin-Nk2a (48.3%) in NK-T but only 1.7% was noted in NK-M. Instead, short neurotoxin isoforms were up-regulated in NK-M (46.4%). Another distinct transcriptional pattern observed is the exclusively and abundantly expressed cytotoxin CTX-3 in NK-T. The findings suggested correlation with the geographical variation in proteome and toxicity of the venom, and support the call for optimising antivenom production and use in the region. Besides, the current study uncovered full and partial sequences of numerous toxin genes fromN. kaouthiawhich have not been reported hitherto; these includeN. kaouthia-specificl-amino acid oxidase (LAAO), snake venom serine protease (SVSP), cystatin, acetylcholinesterase (AChE), hyaluronidase (HYA), waprin, phospholipase B (PLB), aminopeptidase (AP), neprilysin, etc. Taken together, the findings further enrich the snake toxin database and provide deeper insights into the genetic diversity of cobra venom toxins.


2019 ◽  
Author(s):  
Sophie I Holland ◽  
Richard J Edwards ◽  
Haluk Ertan ◽  
Yie Kuan Wong ◽  
Tonia L Russell ◽  
...  

Bacteria capable of dechlorinating the toxic environmental contaminant dichloromethane (DCM, CH2Cl2) are of great interest for potential bioremediation applications. A novel, strictly anaerobic, DCM-fermenting bacterium, "DCMF", was enriched from organochlorine-contaminated groundwater near Botany Bay, Australia. The enrichment culture was maintained in minimal, mineral salt medium amended with dichloromethane as the sole energy source. PacBio whole genome SMRTTM sequencing of DCMF allowed de novo, gap-free assembly despite the presence of cohabiting organisms in the culture. Illumina sequencing reads were utilised to correct minor indels. The single, circularised 6.44 Mb chromosome was annotated with the IMG pipeline and contains 5,773 predicted protein-coding genes. Based on 16S rRNA gene and predicted proteome phylogeny, the organism appears to be a novel member of the Peptococcaceae family. The DCMF genome is large in comparison to known DCM-fermenting bacteria and includes 96 predicted methylamine methyltransferases, which may provide clues to the basis of its DCM metabolism. Full annotation has been provided in a custom genome browser and search tool, in addition to multiple sequence alignments and phylogenetic trees for every predicted protein, available at http://www.slimsuite.unsw.edu.au/research/dcmf/.


2015 ◽  
Author(s):  
Xiaolong Wang ◽  
Chao Yang

Multiple sequence alignment (MSA) is widely used to reveal structural and functional changes leading to genetic differences among species, and to reconstruct evolutionary histories of related genes, proteins and genomes. Traditionally, proteins and their coding sequences (CDSs) are aligned and analyzed separately, but often drastically different conclusions were drawn on a same set of data. Here we present a new alignment strategy, Codon and Amino Acid Unified Sequence Alignment (CAUSA) 2.0, which aligns proteins and their coding sequences simultaneously. CAUSA 2.0 optimizes the alignment of CDSs at both codon and amino acid level efficiently. Theoretical analysis showed that CAUSA 2.0 enhances the entropy information content of MSA. Empirical data analysis demonstrated that CAUSA 2.0 is more accurate and consistent than nucleotide, protein or codon level alignments. CAUSA 2.0 locates in-frame indels more accurately, makes the alignment of coding sequences biologically more significant, and reveals several novel mutation mechanisms that relate to some genetic diseases. CAUSA 2.0 is available in website www.DNAPlusPro.com .


2020 ◽  
Author(s):  
Dustin J. Wcisel ◽  
J. Thomas Howard ◽  
Jeffrey A. Yoder ◽  
Alex Dornburg

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource. Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question. Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.


2019 ◽  
Author(s):  
Alex Dornburg ◽  
Dustin J. Wcisel ◽  
J. Thomas Howard ◽  
Jeffrey A. Yoder

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.


Author(s):  
Ian R. Humphreys ◽  
Jimin Pei ◽  
Minkyung Baek ◽  
Aditya Krishnakumar ◽  
Ivan Anishchenko ◽  
...  

AbstractProtein-protein interactions play critical roles in biology, but despite decades of effort, the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions that have not yet been identified. Here, we take advantage of recent advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes, as represented within the Saccharomyces cerevisiae proteome. We use a combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence alignments for 8.3 million pairs of S. cerevisiae proteins and build models for strongly predicted protein assemblies with two to five components. Comparison to existing interaction and structural data suggests that these predictions are likely to be quite accurate. We provide structure models spanning almost all key processes in Eukaryotic cells for 104 protein assemblies which have not been previously identified, and 608 which have not been structurally characterized.One-sentence summaryWe take advantage of recent advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes.


2010 ◽  
Vol 08 (05) ◽  
pp. 809-823 ◽  
Author(s):  
FREDRIK JOHANSSON ◽  
HIROYUKI TOH

The Shannon entropy is a common way of measuring conservation of sites in multiple sequence alignments, and has also been extended with the relative Shannon entropy to account for background frequencies. The von Neumann entropy is another extension of the Shannon entropy, adapted from quantum mechanics in order to account for amino acid similarities. However, there is yet no relative von Neumann entropy defined for sequence analysis. We introduce a new definition of the von Neumann entropy for use in sequence analysis, which we found to perform better than the previous definition. We also introduce the relative von Neumann entropy and a way of parametrizing this in order to obtain the Shannon entropy, the relative Shannon entropy and the von Neumann entropy at special parameter values. We performed an exhaustive search of this parameter space and found better predictions of catalytic sites compared to any of the previously used entropies.


1996 ◽  
Vol 7 (2) ◽  
pp. 233-243 ◽  
Author(s):  
J O'Brien ◽  
M R al-Ubaidi ◽  
H Ripps

We have used low stringency hybridization to clone a novel connexin from a skate retinal cDNA library. A rat connexin 32 clone was used to isolate a single partial clone that was subsequently used to isolate seven more overlapping clones of the same cDNA. Two clones containing the entire open reading frame have a consensus sequence of 1456 bp and predict a protein of 302 amino acids length and molecular mass of 35,044 daltons, referred to as connexin 35 or Cx35. Southern blot analysis suggests that the cloned sequence lies in a single gene with one intron. Polymerase chain reaction amplification from genomic DNA and partial sequencing of this intron showed that it was approximately 950 bp in length, and located within the coding region 71 bp after the translation start site. Hydropathy analysis of the predicted protein and alignments with previously cloned connexins indicate that Cx35 has a long cytoplasmic loop and a relatively short carboxyl terminal tail. Multiple sequence alignments show that Cx35 has similarities to both alpha and beta groups of connexins and suggests that its origins may be near the divergence point for the two groups. Consensus sequences consistent with sites for phosphorylation by protein kinase C and by cAMP - or cGMP -dependent protein kinase were identified. Two transcripts were detected in Northern blot analysis: a 1.95-kb primary transcript and a 4.6-kb minor transcript. In RNA samples from 10 tissues, transcripts were detected only in the retina.


Sign in / Sign up

Export Citation Format

Share Document