A new alignment free genome comparison algorithm based on statistically estimated feature frequency profile

Author(s):  
Hyein Seo ◽  
Dong-Ho Cho
2009 ◽  
Vol 106 (8) ◽  
pp. 2677-2682 ◽  
Author(s):  
Gregory E. Sims ◽  
Se-Ran Jun ◽  
Guohong A. Wu ◽  
Sung-Hou Kim

Author(s):  
Yuanning Li ◽  
Kyle T. David ◽  
Xing-Xing Shen ◽  
Jacob L. Steenwyk ◽  
Kenneth M. Halanych ◽  
...  

AbstractChoi and Kim (PNAS, 117: 3678-3686; first published February 4, 2020; https://doi.org/10.1073/pnas.1915766117) used the alignment-free Feature Frequency Profile (FFP) method to reconstruct a broad sketch of the tree of life based on proteome data from 4,023 taxa. The FFP-based reconstruction reports many relationships that strongly contradict the current consensus view of the tree of life and its accuracy has not been tested. Comparison of FFP with current standard approaches, such as concatenation and coalescence, using simulation analyses shows that FFP performs poorly. We conclude that the phylogeny of the tree of life reconstructed by Choi and Kim is suspect based on methodology as well as prior phylogenetic evidence.


BMC Genomics ◽  
2018 ◽  
Vol 19 (1) ◽  
Author(s):  
Kujin Tang ◽  
Jie Ren ◽  
Richard Cronn ◽  
David L. Erickson ◽  
Brook G. Milligan ◽  
...  

2020 ◽  
Vol 117 (7) ◽  
pp. 3678-3686 ◽  
Author(s):  
JaeJin Choi ◽  
Sung-Hou Kim

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.


2020 ◽  
Vol 117 (50) ◽  
pp. 31580-31581 ◽  
Author(s):  
Yuanning Li ◽  
Kyle T. David ◽  
Xing-Xing Shen ◽  
Jacob L. Steenwyk ◽  
Kenneth M. Halanych ◽  
...  

2020 ◽  
Vol 21 (11) ◽  
pp. 3859
Author(s):  
Lily He ◽  
Rui Dong ◽  
Rong Lucy He ◽  
Stephen S.-T. Yau

Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.


Sign in / Sign up

Export Citation Format

Share Document