CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

Mapping Intimacies ◽

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SAINT: automatic taxonomy embedding and categorization by Siamese triplet network

10.1101/2021.01.20.426920 ◽

2021 ◽

Author(s):

Yang Young Lu ◽

Yiwen Wang ◽

Fang Zhang ◽

Jiaxing Bai ◽

Ying Wang

Keyword(s):

Sequence Analysis ◽

Sequence Comparison ◽

Large Scale ◽

Sequence Data ◽

Comparison Method ◽

Supplementary Information ◽

Data Alignment ◽

Alignment Free ◽

Comparison Methods ◽

Real World Datasets

AbstractMotivationUnderstanding the phylogenetic relationship among organisms is the key in contemporary evolutionary study and sequence analysis is the workhorse towards this goal. Conventional approaches to sequence analysis are based on sequence alignment, which is neither scalable to large-scale datasets due to computational inefficiency nor adaptive to next-generation sequencing (NGS) data. Alignment-free approaches are typically used as computationally effective alternatives yet still suffering the high demand of memory consumption. One desirable sequence comparison method at large-scale requires succinctly-organized sequence data management, as well as prompt sequence retrieval given a never-before-seen sequence as query.ResultsIn this paper, we proposed a novel approach, referred to as SAINT, for efficient and accurate alignment-free sequence comparison. Compared to existing alignment-free sequence comparison methods, SAINT offers advantages in two aspects: (1) SAINT is a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data; (2) SAINT utilizes the non-linear deep learning-based model which potentially better captures the complicated relationship among genome sequences. We have applied SAINT to real-world datasets to demonstrate its empirical utility, both qualitatively and quantitatively. Considering the extensive applicability of alignment-free sequence comparison methods, we expect SAINT to motivate a more extensive set of applications in sequence comparison at large scale.AvailabilityThe open source, Apache licensed, python-implemented code will be available upon acceptance.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

GigaScience ◽

10.1093/gigascience/giaa048 ◽

2020 ◽

Vol 9 (5) ◽

Cited By ~ 1

Author(s):

Morteza Hosseini ◽

Diogo Pratas ◽

Burkhard Morgenstern ◽

Armando J Pinho

Keyword(s):

Dna Sequences ◽

Large Scale ◽

High Throughput Sequencing ◽

Genetic Disorders ◽

Chromosomal Evolution ◽

Genomic Rearrangements ◽

Efficient Tool ◽

Compression Technique ◽

Alignment Free ◽

Memory Efficient

Abstract Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.

Download Full-text

BleTIES: Annotation of natural genome editing in ciliates using long read sequencing

10.1101/2021.05.18.444610 ◽

2021 ◽

Author(s):

Brandon K. B. Seah ◽

Estienne C. Swart

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Low Complexity ◽

Supplementary Information ◽

Neighboring Element ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Element Elimination

Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, and also the ability to detect correlations of neighboring element elimination. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation: BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Contact: [email protected] Supplementary information: Benchmarking of BleTIES with published sequence data.

Download Full-text

annonex2embl: automatic preparation of annotated DNA sequences for bulk submissions to ENA

Bioinformatics ◽

10.1093/bioinformatics/btaa209 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3841-3848

Author(s):

Michael Gruenstaeudl

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Sequence Data ◽

Complete Sequence ◽

Supplementary Information ◽

Biological Research ◽

Sequence Alignments ◽

Easy Integration ◽

Central Pillar ◽

Python Package

Abstract Motivation The submission of annotated sequence data to public sequence databases constitutes a central pillar in biological research. The surge of novel DNA sequences awaiting database submission due to the application of next-generation sequencing has increased the need for software tools that facilitate bulk submissions. This need has yet to be met with the concurrent development of tools to automate the preparatory work preceding such submissions. Results The author introduce annonex2embl, a Python package that automates the preparation of complete sequence flatfiles for large-scale sequence submissions to the European Nucleotide Archive. The tool enables the conversion of DNA sequence alignments that are co-supplied with sequence annotations and metadata to submission-ready flatfiles. Among other features, the software automatically accounts for length differences among the input sequences while maintaining correct annotations, automatically interlaces metadata to each record and displays a design suitable for easy integration into bioinformatic workflows. As proof of its utility, annonex2embl is employed in preparing a dataset of more than 1500 fungal DNA sequences for database submission. Availability and implementation annonex2embl is freely available via the Python package index at http://pypi.python.org/pypi/annonex2embl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

10.1101/2019.12.23.887349 ◽

2019 ◽

Cited By ~ 2

Author(s):

Morteza Hosseini ◽

Diogo Pratas ◽

Burkhard Morgenstern ◽

Armando J. Pinho

Keyword(s):

Dna Sequences ◽

Large Scale ◽

High Throughput Sequencing ◽

Genetic Disorders ◽

Chromosomal Evolution ◽

Genomic Rearrangements ◽

Efficient Tool ◽

Compression Technique ◽

Alignment Free ◽

Memory Efficient

AbstractBackgroundThe development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial due to their role in chromosomal evolution, genetic disorders and cancer;ResultsWe present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between two DNA sequences. This computational solution extracts information contents of the two sequences, exploiting a data compression technique, in order for finding rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image;ConclusionsTested on several synthetic and real DNA sequences from bacteria, fungi, Aves and mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions complied with previous studies which took alignment-based approaches or performed FISH (Fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ~1 GB, which makes Smash++ feasible to run on present-day standard computers.

Download Full-text

FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

10.1101/380824 ◽

2018 ◽

Cited By ~ 4

Author(s):

Benjamin T. James ◽

Brian B. Luczak ◽

Hani Z. Girgis

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Linear Models ◽

Linear Time ◽

Pairwise Alignment ◽

Supplementary Information ◽

General Linear ◽

General Linear Models ◽

Alignment Free ◽

Identity Score

AbstractMotivationPairwise alignment is a predominant algorithm in the field of bioinformatics. This algorithm is quadratic — slow especially on long sequences. Many applications utilize identity scores without the corresponding alignments. For these applications, we propose FASTCAR. It produces identity scores for pairs of DNA sequences using alignment-free methods and two self-supervised general linear models.ResultsFor the first time, the new tool can predict the pair-wise identity score in linear time and space. On two large-scale sequence databases, FASTCAR provided the best compromise between sensitivity and precision while being faster than BLAST by 40% and faster than USEARCH by 6–10 times. Further, FASTCAR is capable of producing the pair-wise identity scores of long DNA sequences — millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any alignment-based tool.AvailabilityFASTCAR is available at https://github.com/TulsaBioinformaticsToolsmith/FASTCAR and as the Supplementary Dataset [email protected] informationSupplementary data are available online.

Download Full-text

BleTIES: Annotation of natural genome editing in ciliates using long read sequencing

Bioinformatics ◽

10.1093/bioinformatics/btab613 ◽

2021 ◽

Author(s):

Brandon K B Seah ◽

Estienne C Swart

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Low Complexity ◽

Supplementary Information ◽

Software Toolkit ◽

Assembly Strategy ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read

Abstract Summary Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are typically much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, but require a different assembly strategy. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Supplementary information Benchmarking of BleTIES with published sequence data.

Download Full-text

The number of spaced-word matches between two DNA sequences as a function of the underlying pattern weight

10.1101/527515 ◽

2019 ◽

Cited By ~ 5

Author(s):

Sophie Röhling ◽

Burkhard Morgenstern

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Word Length ◽

Common Ancestor ◽

Sequence Data ◽

High Accuracy ◽

Last Common Ancestor ◽

Statistical Stability ◽

Alignment Free ◽

Simulated Sequence

AbstractWe study the number Nk of (spaced) word matches between pairs of evolutionarily related DNA sequences depending on the word length or pattern weight k, respectively. We show that, under the Jukes-Cantor model, the number of substitutions per site that occurred since two sequences evolved from their last common ancestor, can be esti-mated from the slope of a certain function of Nk. Based on these considerations, we implemented a software program for alignment-free sequence comparison called Slope-SpaM. Test runs on simulated sequence data show that Slope-SpaM can estimate phylogenetic dis-tances with high accuracy for up to around 0.5 substitutions per po-sitions. The statistical stability of our results is improved if spaced words are used instead of contiguous k-mers. Unlike previous methods that are based on the number of (spaced) word matches, our approach can deal with sequences that share only local homologies.

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text