An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

Alignment methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational costs in handling time and space complexity. On the other hand, alignment-free methods incur low computational costs and have recently gained popularity in the field of bioinformatics. Here we propose a new alignment-free method for phylogenetic tree reconstruction based on whole genome sequences. A key component is a measure called information-entropy position-weighted k-mer relative measure (IEPWRMkmer), which combines the position-weighted measure of k-mers proposed by our group and the information entropy of frequency of k-mers. The Manhattan distance is used to calculate the pairwise distance between species. Finally, we use the Neighbor-Joining method to construct the phylogenetic tree. To evaluate the performance of this method, we perform phylogenetic analysis on two datasets used by other researchers. The results demonstrate that the IEPWRMkmer method is efficient and reliable. The source codes of our method are provided at https://github.com/ wuyaoqun37/IEPWRMkmer.

Download Full-text

Phylogenetic Tree Construction Using K-Mer Forest- Based Distance Calculation

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v16i07.13807 ◽

2020 ◽

Vol 16 (07) ◽

pp. 4 ◽

Cited By ~ 1

Author(s):

Gihan Gamage ◽

Nadeeshan Gimhana ◽

Indika Perera ◽

Shanaka Bandara ◽

Thilina Pathirana ◽

...

Keyword(s):

Phylogenetic Tree ◽

Dna Sequences ◽

Phylogenetic Trees ◽

Genetic Relatedness ◽

Biological Information ◽

Pairwise Distance ◽

Phylogenetic Tree Construction ◽

Distance Calculation ◽

Alignment Free ◽

Tree Construction

Phylogenetics is one of the dominant data engineering research disciplines based on biological information. More particularly here, we consider raw DNA sequences and do comparative analysis in order to come up with important conclusions. When representing evolutionary relationships among different organisms in a concise manner, the phylogenetic tree helps significantly. When constructing phylogenetic trees, the elementary step is to calculate the genetic distance among species. Alignment-based sequencing and alignment-free sequencing are the two main distance computation methods that are used to find genetic relatedness of different species. In this paper we propose a novel alignment-free, pairwise, distance calculation method based on k-mers and a state of art machine learning-based phylogenetic tree construction mechanism. With the proposed approach we can convert longer DNA sequences into compendious k-mer forests which gear up the efficiency of comparison. Later we construct the phylogenetic tree based on calculated distances with the help of an algorithm build upon k-medoid clustering, which guaranteed significant efficiency and accuracy compared to traditional phylogenetic tree construction methods.

Download Full-text

SANS serif: alignment-free, whole-genome based phylogenetic reconstruction

10.1101/2020.12.31.424643 ◽

2021 ◽

Author(s):

Andreas Rempel ◽

Roland Wittler

Keyword(s):

Phylogenetic Tree ◽

Source Code ◽

Phylogenetic Reconstruction ◽

Whole Genome ◽

Link Type ◽

Alignment Free ◽

Phylogeny Estimation

AbstractSummarySANS serif is a novel software for alignment-free, whole-genome based phylogeny estimation that follows a pangenomic approach to efficiently calculate a set of splits in a phylogenetic tree or network.Availability and ImplementationImplemented in C++ and supported on Linux, MacOS, and Windows. The source code is freely available for download at https://gitlab.ub.uni-bielefeld.de/gi/[email protected]

Download Full-text

An Alignment-free Heuristic for Fast Sequence Comparisons with Applications to Phylogeny Reconstruction

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '18 ◽

10.1145/3233547.3233648 ◽

2018 ◽

Author(s):

Jodh Pannu ◽

Sriram P. Chockalingam ◽

Sharma V. Thankachan ◽

Srinivas Aluru

Keyword(s):

Phylogeny Reconstruction ◽

Sequence Comparisons ◽

Alignment Free

Download Full-text

Whole-Genome k-mer Topic Modeling Associates Bacterial Families

Genes ◽

10.3390/genes11020197 ◽

2020 ◽

Vol 11 (2) ◽

pp. 197

Author(s):

Ernesto Borrayo ◽

Isaias May-Canche ◽

Omar Paredes ◽

J. Alejandro Morales ◽

Rebeca Romo-Vázquez ◽

...

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Hierarchical Classification ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequence Comparisons ◽

Alignment Free ◽

Biological Phenomena ◽

Topic Distribution ◽

Genome Comparisons

Alignment-free k-mer-based algorithms in whole genome sequence comparisons remain an ongoing challenge. Here, we explore the possibility to use Topic Modeling for organism whole-genome comparisons. We analyzed 30 complete genomes from three bacterial families by topic modeling. For this, each genome was considered as a document and 13-mer nucleotide representations as words. Latent Dirichlet allocation was used as the probabilistic modeling of the corpus. We where able to identify the topic distribution among analyzed genomes, which is highly consistent with traditional hierarchical classification. It is possible that topic modeling may be applied to establish relationships between genome’s composition and biological phenomena.

Download Full-text

Whole-proteome tree of life suggests a deep burst of organism diversity

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1915766117 ◽

2020 ◽

Vol 117 (7) ◽

pp. 3678-3686 ◽

Cited By ~ 5

Author(s):

JaeJin Choi ◽

Sung-Hou Kim

Keyword(s):

Information Theory ◽

Genome Sequence ◽

Tree Of Life ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequences ◽

Alignment Free ◽

Whole Transcriptome ◽

Evolutionary Progression ◽

Feature Frequency

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.

Download Full-text

Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0045 ◽

2019 ◽

Vol 18 (2) ◽

Author(s):

Hsin-Hsiung Huang ◽

Senthil Balaji Girimurugan

Keyword(s):

Discriminant Analysis ◽

Wavelet Packet ◽

Wavelet Packet Transform ◽

Discrete Wavelet ◽

Whole Genome ◽

Statistical Classification ◽

Genome Sequences ◽

Discrete Wavelet Packet Transform ◽

Alignment Free ◽

Free Representation

Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.

Download Full-text

2011 German Escherichia coli outbreak: Alignment-free whole-genome phylogeny by feature frequency profiles

Nature Precedings ◽

10.1038/npre.2011.6109.1 ◽

2011 ◽

Author(s):

Man Kit Cheung ◽

Lei LI ◽

Wenyan Nong ◽

Hoi Shan Kwan

Keyword(s):

Escherichia Coli ◽

Whole Genome ◽

Alignment Free ◽

Genome Phylogeny ◽

Feature Frequency

Download Full-text

Alignment-free Whole Genome Comparison Using k-mer Forests

2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer) ◽

10.1109/icter48817.2019.9023714 ◽

2019 ◽

Cited By ~ 1

Author(s):

G. Gamage ◽

N. Gimhana ◽

A. Wickramarachchi ◽

V. Mallawaarachchi ◽

I. Perera

Keyword(s):

Genome Comparison ◽

Whole Genome ◽

Alignment Free ◽

Whole Genome Comparison

Download Full-text

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

BMC Bioinformatics ◽

10.1186/s12859-020-03738-5 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Sriram P. Chockalingam ◽

Jodh Pannu ◽

Sahar Hooshmand ◽

Sharma V. Thankachan ◽

Srinivas Aluru

Keyword(s):

Phylogenetic Trees ◽

Linear Time ◽

Sequence Similarity ◽

Similarity Measures ◽

Phylogeny Reconstruction ◽

Greedy Heuristics ◽

Biological Sequences ◽

Sequence Comparisons ◽

Multiple Sequence ◽

Alignment Free

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

Download Full-text

FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data

2007 IEEE 7th International Symposium on BioInformatics and BioEngineering ◽

10.1109/bibe.2007.4375664 ◽

2007 ◽

Cited By ~ 8

Author(s):

Jason D. Bakos ◽

Panormitis E. Elenis ◽

Jijun Tang

Keyword(s):

Phylogeny Reconstruction ◽

Whole Genome ◽

Genome Data ◽

Fpga Acceleration

Download Full-text