Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Abstract Motivation Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix–prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20–80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and implementation https://github.com/yuansliu/minicom Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes

International Journal of Genomics ◽

10.1155/2015/563482 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Krisztian Buza ◽

Bartek Wilczynski ◽

Norbert Dojer

Keyword(s):

Reference Genome ◽

De Novo ◽

Real Data ◽

Reference Sequence ◽

Individual Genome ◽

Single Experiment ◽

Sequencing Technologies ◽

Sequencing Cost ◽

The Individual ◽

Assembly Software

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

Download Full-text

AptCompare: optimized de novo motif discovery of RNA aptamers via HTS-SELEX

Bioinformatics ◽

10.1093/bioinformatics/btaa054 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2905-2906 ◽

Cited By ~ 1

Author(s):

Kevin R Shieh ◽

Christina Kratschmer ◽

Keith E Maier ◽

John M Greally ◽

Matthew Levy ◽

...

Keyword(s):

Motif Discovery ◽

High Throughput Sequencing ◽

De Novo ◽

Rna Aptamers ◽

Supplementary Information ◽

Good Correspondence ◽

Detection Algorithms ◽

De Novo Motif Discovery ◽

Exponential Enrichment ◽

Analytical Approaches

Abstract Summary High-throughput sequencing can enhance the analysis of aptamer libraries generated by the Systematic Evolution of Ligands by EXponential enrichment. Robust analysis of the resulting sequenced rounds is best implemented by determining a ranked consensus of reads following the processing by multiple aptamer detection algorithms. While several such approaches have been developed to this end, their installation and implementation is problematic. We developed AptCompare, a cross-platform program that combines six of the most widely used analytical approaches for the identification of RNA aptamer motifs and uses a simple weighted ranking to order the candidate aptamers, all driven within the same GUI-enabled environment. We demonstrate AptCompare’s performance by identifying the top-ranked candidate aptamers from a previously published selection experiment in our laboratory, with follow-up bench assays demonstrating good correspondence between the sequences’ rankings and their binding affinities. Availability and implementation The source code and pre-built virtual machine images are freely available at https://bitbucket.org/shiehk/aptcompare. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPRING: a next-generation compressor for FASTQ data

Bioinformatics ◽

10.1093/bioinformatics/bty1015 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2674-2676 ◽

Cited By ~ 18

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Idoia Ochoa ◽

Mikel Hernaez ◽

Tsachy Weissman

Keyword(s):

High Throughput Sequencing ◽

Random Access ◽

Lossless Compression ◽

General Purpose ◽

Supplementary Information ◽

High Coverage ◽

Sequencing Technologies ◽

Long Read ◽

Previous State ◽

Computational Resources

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

IgMAT: immunoglobulin sequence multi-species annotation tool for any species including those with incomplete antibody annotation or unusual characteristics

10.1101/2021.09.22.461368 ◽

2021 ◽

Author(s):

Daniel Dorey-Robinson ◽

Giuseppe Maccari ◽

Richard Borne ◽

John A. Hammond

Keyword(s):

High Throughput Sequencing ◽

Structural Characteristics ◽

Supplementary Information ◽

Sequencing Technologies ◽

Species Lists ◽

Repertoire Sequencing ◽

Immunoglobulin Repertoire ◽

Amino Acid Alphabet ◽

Study Species ◽

Incomplete Antibody

AbstractThe advent and continual improvement of high-throughput sequencing technologies has made immunoglobulin repertoire sequencing accessible and informative regardless of study species. However, to fully map changes in polyclonal dynamics, precise annotation of these constantly rearranging genes is pivotal. For this reason, data agnostic tools able to learn from presented data are required. Most sequence annotation tools are designed primarily for use with human and mouse antibody sequences which use databases with fixed species lists, applying very specific assumptions which select against unique structural characteristics. We present IgMAT, which utilises a reduced amino acid alphabet, incorporates multiple HMM alignments into a single consensus and enables the incorporation of user defined databases to better represent their species of interest.Availability and implementationIgMAT has been developed as a python module, and is available on GitHub (https://github.com/TPI-Immunogenetics/igmat) for download under GPLv3 license.Supplementary informationModel Breakdowns

Download Full-text

Impact of human gene annotations on RNA-seq differential expression analysis

10.21203/rs.3.rs-301856/v1 ◽

2021 ◽

Author(s):

Yu Hamaguchi ◽

Chao Zeng ◽

Michiaki Hamada

Keyword(s):

Differential Expression ◽

High Throughput ◽

High Throughput Sequencing ◽

Human Gene ◽

Gene Annotation ◽

Differential Expression Analysis ◽

Rna Seq ◽

Gene Annotations ◽

Sequencing Technologies ◽

The Impact

Abstract Background: Differential expression (DE) analysis of RNA-seq data typically depends on gene annotations. Different sets of gene annotations are available for the human genome and are continually updated–a process complicated with the development and application of high-throughput sequencing technologies. However, the impact of the complexity of gene annotations on DE analysis remains unclear.Results: Using “mappability”, a metric of the complexity of gene annotation, we compared three distinct human gene annotations, GENCODE, RefSeq, and NONCODE, and evaluated how mappability affected DE analysis. We found that mappability was significantly different among the human gene annotations. We also found that increasing mappability improved the performance of DE analysis, and the impact of mappability mainly evident in the quantification step and propagated downstream of DE analysis systematically.Conclusions: We assessed how the complexity of gene annotations affects DE analysis using mappability. Our findings indicate that the growth and complexity of gene annotations negatively impact the performance of DE analysis, suggesting that an approach that excludes unnecessary gene models from gene annotations improves the performance of DE analysis.

Download Full-text

Customized de novo mutation detection for any variant calling pipeline: SynthDNM

Bioinformatics ◽

10.1093/bioinformatics/btab225 ◽

2021 ◽

Author(s):

Aojie Lian ◽

James Guevara ◽

Kun Xia ◽

Jonathan Sebat

Keyword(s):

Mutation Detection ◽

De Novo ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

De Novo Mutation ◽

Flexible Approach ◽

Sequencing Technologies ◽

Training Examples ◽

Simulated Training

Abstract Motivation As sequencing technologies and analysis pipelines evolve, de novo mutation (DNM) calling tools must be adapted. Therefore, a flexible approach is needed that can accurately identify DNMs from genome or exome sequences from a variety of datasets and variant calling pipelines. Results Here, we describe SynthDNM, a random-forest based classifier that can be readily adapted to new sequencing or variant-calling pipelines by applying a flexible approach to constructing simulated training examples from real data. The optimized SynthDNM classifiers predict de novo SNPs and indels with robust accuracy across multiple methods of variant calling. Availabilityand implementation SynthDNM is freely available on Github (https://github.com/james-guevara/synthdnm). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Factorbook Motif Pipeline: A de novo motif discovery and filtering web server for ChIP-seq peaks

10.1101/033670 ◽

2015 ◽

Cited By ~ 1

Author(s):

Bong-Hyun Kim ◽

Jiali Zhuang ◽

Jie Wang ◽

Zhiping Weng

Keyword(s):

Motif Discovery ◽

High Throughput Sequencing ◽

De Novo ◽

Statistical Tests ◽

Web Server ◽

Biological Processes ◽

Web Based ◽

Sequencing Technologies ◽

De Novo Motif Discovery

Summary: High-throughput sequencing technologies such as ChIP-seq have deepened our understanding in many biological processes. De novo motif search is one of the key downstream computational analysis following the ChIP-seq experiments and several algorithms have been proposed for this purpose. However, most web-based systems do not perform independent filtering or enrichment analyses to ensure the quality of the discovered motifs. Here, we developed a web server Factorbook Motif Pipeline based on an algorithm used in analyzing ENCODE consortium ChIP-seq datasets. It performs comprehensive analysis on the set of peaks detected from a ChIP-seq experiments: (i) de novo motif discovery; (ii) independent composition and bias analyses and (iii) matching to the annotated motifs. The statistical tests employed in our pipeline provide a reliable measure of confidence as to how significant are the motifs reported in the discovery step. Availability: Factorbook Motif Pipeline source code is accessible through the following URL. https://github.com/joshuabhk/factorbook-motif-pipeline

Download Full-text