FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Mapping Intimacies ◽

10.1101/380824 ◽

2018 ◽

Cited By ~ 4

Author(s):

Benjamin T. James ◽

Brian B. Luczak ◽

Hani Z. Girgis

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Linear Models ◽

Linear Time ◽

Pairwise Alignment ◽

Supplementary Information ◽

General Linear ◽

General Linear Models ◽

Alignment Free ◽

Identity Score

AbstractMotivationPairwise alignment is a predominant algorithm in the field of bioinformatics. This algorithm is quadratic — slow especially on long sequences. Many applications utilize identity scores without the corresponding alignments. For these applications, we propose FASTCAR. It produces identity scores for pairs of DNA sequences using alignment-free methods and two self-supervised general linear models.ResultsFor the first time, the new tool can predict the pair-wise identity score in linear time and space. On two large-scale sequence databases, FASTCAR provided the best compromise between sensitivity and precision while being faster than BLAST by 40% and faster than USEARCH by 6–10 times. Further, FASTCAR is capable of producing the pair-wise identity scores of long DNA sequences — millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any alignment-based tool.AvailabilityFASTCAR is available at https://github.com/TulsaBioinformaticsToolsmith/FASTCAR and as the Supplementary Dataset [email protected] informationSupplementary data are available online.

Download Full-text

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab001 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Hani Z Girgis ◽

Benjamin T James ◽

Brian B Luczak

Keyword(s):

Dna Sequences ◽

Phylogenetic Trees ◽

Linear Models ◽

General Linear ◽

Global Alignment ◽

Optimal Alignment ◽

Pairwise Identity ◽

General Linear Models ◽

Alignment Free ◽

Alignment Algorithms

Abstract Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A note on misspecification in general linear models with correlated errors for the analysis of crossover clinical trials

PLoS ONE ◽

10.1371/journal.pone.0213436 ◽

2019 ◽

Vol 14 (3) ◽

pp. e0213436 ◽

Cited By ~ 1

Author(s):

Wei Wang ◽

Ning Cong ◽

Tian Chen ◽

Hui Zhang ◽

Bo Zhang

Keyword(s):

Clinical Trials ◽

Linear Models ◽

General Linear ◽

Correlated Errors ◽

General Linear Models

Download Full-text

Bedside assessment of residual functional activation in minimally conscious state using NIRS and general linear models

2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) ◽

10.1109/embc.2013.6610309 ◽

2013 ◽

Cited By ~ 1

Author(s):

Erika Molteni ◽

Filippo Arrigoni ◽

Alessandra Bardoni ◽

Sara Galbiati ◽

Federica Villa ◽

...

Keyword(s):

Linear Models ◽

Minimally Conscious State ◽

Conscious State ◽

General Linear ◽

General Linear Models ◽

Residual Functional ◽

Minimally Conscious

Download Full-text

Managing heteroscedasticity in general linear models.

Psychological Methods ◽

10.1037/a0032553 ◽

2013 ◽

Vol 18 (3) ◽

pp. 335-351 ◽

Cited By ~ 29

Author(s):

Patrick J. Rosopa ◽

Meline M. Schaffer ◽

Amber N. Schroeder

Keyword(s):

Linear Models ◽

General Linear ◽

General Linear Models

Download Full-text

Regression and General Linear Models

Using R for Statistics ◽

10.1007/978-1-4842-0139-8_11 ◽

2014 ◽

pp. 163-184

Author(s):

Sarah Stowell

Keyword(s):

Linear Models ◽

General Linear ◽

General Linear Models

Download Full-text

General Linear Models

CRC Standard Probability and Statistics Tables and Formulae ◽

10.1201/9781420050264.ch16 ◽

1999 ◽

Keyword(s):

Linear Models ◽

General Linear ◽

General Linear Models

Download Full-text

General Linear Models

Introduction to Bayesian Methods in Ecology and Natural Resources ◽

10.1007/978-3-030-60750-0_7 ◽

2020 ◽

pp. 131-153

Author(s):

Edwin J. Green ◽

Andrew O. Finley ◽

William E. Strawderman

Keyword(s):

Linear Models ◽

General Linear ◽

General Linear Models

Download Full-text

Local Influence in Constrained General Linear Models

Journal of Data Science ◽

10.6339/jds.201410_12(4).0008 ◽

2021 ◽

Vol 12 (4) ◽

pp. 717-726

Author(s):

Hadi Emami ◽

Mostafa Emami

Keyword(s):

Linear Models ◽

Local Influence ◽

General Linear ◽

General Linear Models

Download Full-text