scholarly journals Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Hani Z Girgis ◽  
Benjamin T James ◽  
Brian B Luczak

Abstract Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

2018 ◽  
Author(s):  
Benjamin T. James ◽  
Brian B. Luczak ◽  
Hani Z. Girgis

AbstractMotivationPairwise alignment is a predominant algorithm in the field of bioinformatics. This algorithm is quadratic — slow especially on long sequences. Many applications utilize identity scores without the corresponding alignments. For these applications, we propose FASTCAR. It produces identity scores for pairs of DNA sequences using alignment-free methods and two self-supervised general linear models.ResultsFor the first time, the new tool can predict the pair-wise identity score in linear time and space. On two large-scale sequence databases, FASTCAR provided the best compromise between sensitivity and precision while being faster than BLAST by 40% and faster than USEARCH by 6–10 times. Further, FASTCAR is capable of producing the pair-wise identity scores of long DNA sequences — millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any alignment-based tool.AvailabilityFASTCAR is available at https://github.com/TulsaBioinformaticsToolsmith/FASTCAR and as the Supplementary Dataset [email protected] informationSupplementary data are available online.


PLoS ONE ◽  
2019 ◽  
Vol 14 (3) ◽  
pp. e0213436 ◽  
Author(s):  
Wei Wang ◽  
Ning Cong ◽  
Tian Chen ◽  
Hui Zhang ◽  
Bo Zhang

2013 ◽  
Vol 18 (3) ◽  
pp. 335-351 ◽  
Author(s):  
Patrick J. Rosopa ◽  
Meline M. Schaffer ◽  
Amber N. Schroeder

2010 ◽  
Vol 08 (02) ◽  
pp. 181-198 ◽  
Author(s):  
RAJIB SENGUPTA ◽  
DHUNDY R. BASTOLA ◽  
HESHAM H. ALI

Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order — a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.


Author(s):  
Edwin J. Green ◽  
Andrew O. Finley ◽  
William E. Strawderman

2021 ◽  
Vol 12 (4) ◽  
pp. 717-726
Author(s):  
Hadi Emami ◽  
Mostafa Emami

Sign in / Sign up

Export Citation Format

Share Document