Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Abstract Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

Download Full-text

FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

10.1101/380824 ◽

2018 ◽

Cited By ~ 4

Author(s):

Benjamin T. James ◽

Brian B. Luczak ◽

Hani Z. Girgis

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Linear Models ◽

Linear Time ◽

Pairwise Alignment ◽

Supplementary Information ◽

General Linear ◽

General Linear Models ◽

Alignment Free ◽

Identity Score

AbstractMotivationPairwise alignment is a predominant algorithm in the field of bioinformatics. This algorithm is quadratic — slow especially on long sequences. Many applications utilize identity scores without the corresponding alignments. For these applications, we propose FASTCAR. It produces identity scores for pairs of DNA sequences using alignment-free methods and two self-supervised general linear models.ResultsFor the first time, the new tool can predict the pair-wise identity score in linear time and space. On two large-scale sequence databases, FASTCAR provided the best compromise between sensitivity and precision while being faster than BLAST by 40% and faster than USEARCH by 6–10 times. Further, FASTCAR is capable of producing the pair-wise identity scores of long DNA sequences — millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any alignment-based tool.AvailabilityFASTCAR is available at https://github.com/TulsaBioinformaticsToolsmith/FASTCAR and as the Supplementary Dataset [email protected] informationSupplementary data are available online.

Download Full-text

A note on misspecification in general linear models with correlated errors for the analysis of crossover clinical trials

PLoS ONE ◽

10.1371/journal.pone.0213436 ◽

2019 ◽

Vol 14 (3) ◽

pp. e0213436 ◽

Cited By ~ 1

Author(s):

Wei Wang ◽

Ning Cong ◽

Tian Chen ◽

Hui Zhang ◽

Bo Zhang

Keyword(s):

Clinical Trials ◽

Linear Models ◽

General Linear ◽

Correlated Errors ◽

General Linear Models

Download Full-text

Bedside assessment of residual functional activation in minimally conscious state using NIRS and general linear models

2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) ◽

10.1109/embc.2013.6610309 ◽

2013 ◽

Cited By ~ 1

Author(s):

Erika Molteni ◽

Filippo Arrigoni ◽

Alessandra Bardoni ◽

Sara Galbiati ◽

Federica Villa ◽

...

Keyword(s):

Linear Models ◽

Minimally Conscious State ◽

Conscious State ◽

General Linear ◽

General Linear Models ◽

Residual Functional ◽

Minimally Conscious

Download Full-text

Managing heteroscedasticity in general linear models.

Psychological Methods ◽

10.1037/a0032553 ◽

2013 ◽

Vol 18 (3) ◽

pp. 335-351 ◽

Cited By ~ 29

Author(s):

Patrick J. Rosopa ◽

Meline M. Schaffer ◽

Amber N. Schroeder

Keyword(s):

Linear Models ◽

General Linear ◽

General Linear Models

Download Full-text

Regression and General Linear Models

Using R for Statistics ◽

10.1007/978-1-4842-0139-8_11 ◽

2014 ◽

pp. 163-184

Author(s):

Sarah Stowell

Keyword(s):

Linear Models ◽

General Linear ◽

General Linear Models

Download Full-text

CLASSIFICATION AND IDENTIFICATION OF FUNGAL SEQUENCES USING CHARACTERISTIC RESTRICTION ENDONUCLEASE CUT ORDER

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720010004616 ◽

2010 ◽

Vol 08 (02) ◽

pp. 181-198 ◽

Cited By ~ 2

Author(s):

RAJIB SENGUPTA ◽

DHUNDY R. BASTOLA ◽

HESHAM H. ALI

Keyword(s):

Dna Sequences ◽

Restriction Enzymes ◽

Epidemiological Studies ◽

Global Alignment ◽

Target Sequence ◽

Alignment Algorithm ◽

Molecular Fingerprinting ◽

Data Set ◽

Alignment Free ◽

Wet Lab

Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order — a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.

Download Full-text