scholarly journals Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis

2013 ◽  
Vol 15 (6) ◽  
pp. 890-905 ◽  
Author(s):  
O. Bonham-Carter ◽  
J. Steele ◽  
D. Bastola
Genes ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 197
Author(s):  
Ernesto Borrayo ◽  
Isaias May-Canche ◽  
Omar Paredes ◽  
J. Alejandro Morales ◽  
Rebeca Romo-Vázquez ◽  
...  

Alignment-free k-mer-based algorithms in whole genome sequence comparisons remain an ongoing challenge. Here, we explore the possibility to use Topic Modeling for organism whole-genome comparisons. We analyzed 30 complete genomes from three bacterial families by topic modeling. For this, each genome was considered as a document and 13-mer nucleotide representations as words. Latent Dirichlet allocation was used as the probabilistic modeling of the corpus. We where able to identify the topic distribution among analyzed genomes, which is highly consistent with traditional hierarchical classification. It is possible that topic modeling may be applied to establish relationships between genome’s composition and biological phenomena.


2020 ◽  
Vol 21 (S6) ◽  
Author(s):  
Sriram P. Chockalingam ◽  
Jodh Pannu ◽  
Sahar Hooshmand ◽  
Sharma V. Thankachan ◽  
Srinivas Aluru

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Camilla Reginatto De Pierri ◽  
Ricardo Voyceik ◽  
Letícia Graziela Costa Santos de Mattos ◽  
Mariane Gonçalves Kulik ◽  
Josué Oliveira Camargo ◽  
...  

AbstractVectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.


2018 ◽  
Author(s):  
Jaime Derringer

Although correlations between personality and health are consistently observed, often the causal pathway, or even the direction of effect, is unknown. Genes provide an additional node of information which may be included to help clarify the relationship between personality and health. Genetically informative studies, whether focused on family-identified relationships or specific genotypes, provide clear benefits to disentangling causal processes. Genetic measures approach near universal reliability and validity: processes of inheritance are consistent across cultures, geography, and time, such that similar models and instruments may be applied to incredibly diverse populations. Although frequency and intercorrelations differ by ancestry background (Novembre et al., 2008) and cultural context (Tucker-Drob & Bates, 2016) may exert powerful moderating effects, fundamental form and function is consistent across all members of our species, and even many other species. Genetic sequence information is also of course highly temporally stable, and possesses temporal precedence. That is, the literal genetic sequence is lifetime-stable and comes before all other experiences. Human behavior genetic research, like most personality research, faces limitations in terms of causal inferences that may be made in the absence of experimental manipulation. But behavior genetics takes advantage of natural experiments: populations that differ in terms of genetic similarity (either inferred – such as twins – or measured – such as genotyping methods) to begin to unravel the complex influences on individual differences in personality and health outcomes.


Sign in / Sign up

Export Citation Format

Share Document