SPRING: a next-generation compressor for FASTQ data

Shubham Chandak; Kedar Tatwawadi; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman

doi:10.1093/bioinformatics/bty1015

SPRING: a next-generation compressor for FASTQ data

Bioinformatics ◽

10.1093/bioinformatics/bty1015 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2674-2676 ◽

Cited By ~ 18

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Idoia Ochoa ◽

Mikel Hernaez ◽

Tsachy Weissman

Keyword(s):

High Throughput Sequencing ◽

Random Access ◽

Lossless Compression ◽

General Purpose ◽

Supplementary Information ◽

High Coverage ◽

Sequencing Technologies ◽

Long Read ◽

Previous State ◽

Computational Resources

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MIRUReader: MIRU-VNTR typing directly from long sequencing reads

Bioinformatics ◽

10.1093/bioinformatics/btz771 ◽

2019 ◽

Cited By ~ 1

Author(s):

Cheng Yee Tang ◽

Rick Twee-Hee Ong

Keyword(s):

High Throughput Sequencing ◽

Variable Number Tandem Repeat ◽

Epidemiological Studies ◽

Variable Number ◽

Supplementary Information ◽

Vntr Locus ◽

Tuberculosis Transmission ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read

Abstract Summary Mycobacterial interspersed repetitive unit-variable number tandem repeat (MIRU-VNTR) typing is widely used to genotype Mycobacterium tuberculosis complex in epidemiological studies for tracking tuberculosis transmission. Recent long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore Technologies can produce reads that are long enough to cover the entire repeat regions in each MIRU-VNTR locus which was previously not possible using the short reads from Illumina high-throughput sequencing technologies. We thus developed MIRUReader for MIRU-VNTR typing directly from long sequence reads. Availability and implementation Source code and documentation for MIRUReader program is freely available at https://github.com/phglab/MIRUReader. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ChIPWig: A Random Access-Enabling Lossless and Lossy Compression Method for ChIP-seq Data

10.1101/127464 ◽

2017 ◽

Author(s):

Vida Ravanmehr ◽

Minji Kim ◽

Zhiying Wang ◽

Olgica Milenković

Keyword(s):

Motif Discovery ◽

Rapid Development ◽

Random Access ◽

Downstream Processing ◽

Lossy Compression ◽

General Purpose ◽

Supplementary Information ◽

Peak Calling ◽

Sequencing Technologies ◽

Lossless And Lossy Compression

AbstractMotivationThe past decade has witnessed a rapid development of data acquisition technologies that enable integrative genomic and proteomic analysis. One such technology is chromatin immunoprecipitation sequencing (ChIP-seq), developed for analyzing interactions between proteins and DNA via next-generation sequencing technologies. As ChIP-seq experiments are inexpensive and time-efficient, massive datasets from this domain have been acquired, introducing significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a state-of-the-art lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. Wig is a standard file format, which in this setting contains relevant read density information crucial for visualization and downstream processing. ChIPWig may be executed in two different modes: lossless and lossy. Lossless ChIPWig compression allows for random access and fast queries in the file through careful variable-length block-wise encoding. ChIPWig also stores the summary statistics of each block needed for guided access. Lossy ChIPWig, in contrast, performs quantization of the read density values before feeding them into the lossless ChIPWig compressor. Nonuniform lossy quantization leads to further reductions in the file size, while maintaining the same accuracy of the ChIP-seq peak calling and motif discovery pipeline based on the NarrowPeaks method tailor-made for Wig files. The compressors are designed using new statistical modeling approaches coupled with delta and arithmetic encoding.ResultsWe tested the ChIPWig compressor on a number of ChIP-seq datasets generated by the ENCODE project. Lossless ChIPWig reduces the file sizes to merely 6% of the original, and offers an average 6-fold compression rate improvement compared to bigWig. The running times for compression and decompression are comparable to those of bigWig. The compression and decompression speed rates are of the order of 0.2 MB/sec using general purpose computers. ChIPWig with random access only slightly degrades the performance and running time when compared to the standard mode. In the lossy mode, the average file sizes reduce by 2-fold compared to the lossless mode. Most importantly, near-optimal nonuniform quantization with respect to mean-square distortion does not affect peak calling and motif discovery results on the data tested.Availability and ImplementationSource code and binaries freely available for download at https://github.com/vidarmehr/[email protected] informationIs available on bioRxiv.

Download Full-text

Genome Resources for the Ex-type of Phytophthora citricola, and well-authenticated isolates of P. hibernalis, P. nicotianae and P. syringae

Phytopathology ◽

10.1094/phyto-04-21-0167-a ◽

2021 ◽

Author(s):

Subodh K. Srivastava ◽

Leandra M. Knight ◽

Mark K. Nakhla ◽

Z. Gloria Abad

Keyword(s):

Plant Pathogens ◽

High Throughput Sequencing ◽

Economic Losses ◽

Biological Research ◽

Type Specimens ◽

High Coverage ◽

Oxford Nanopore ◽

Long Read ◽

Basic And Applied Research ◽

Diagnostic Applications

Phytophthora is one of the most important genera of plant pathogens with many members causing high economic losses world-wide. To build robust molecular identification systems, it is very important to have information from well-authenticated specimens and in preference the ex-type specimens. The reference genomes of well-authenticated specimens form a critical foundation for genetics, biological research, and diagnostic applications. In this study, we describe four draft Phytophthora genomes resources for the Ex-type of P. citricola BL34 (P0716 WPC) (118 contigs for 50 Mb), and well-authenticated specimens of P. syringae BL57G (P10330 WPC) (591 contigs for 75 Mb), P. hibernalis BL41G (P3822 WPC) (404 contigs for 84 Mb), and P. nicotianae BL162 (P6303 WPC) (3984 contigs for 108 Mb) generated with MinION long-read High-Throughput Sequencing (HTS) technology (Oxford Nanopore Technologies, ONT). Using the quality reads we assembled high coverage genomes of P. citricola with 291X coverage and 16,662 annotated genes; P. nicotianae with 205X coverage and 29,271 annotated genes; P. syringae with 76X coverage and 23,331 annotated genes, and P. hibernalis with 42X coverage and 21,762 annotated genes. With the availability of genomes sequences and its annotations, we predict that these draft genomes will be accommodating for various basic and applied research including diagnostics to protect global agriculture.

Download Full-text

Improving the Chromosome-Level Genome Assembly of the Siamese Fighting Fish (Betta splendens) in a University Master’s Course

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401205 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2179-2183 ◽

Cited By ~ 1

Author(s):

Stefan Prost ◽

Malte Petersen ◽

Martin Grethlein ◽

Sarah Joy Hahn ◽

Nina Kuschik-Maczollek ◽

...

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Siamese Fighting Fish ◽

Betta Splendens ◽

High Quality ◽

Sequencing Platform ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Chromosome Level

Ever decreasing costs along with advances in sequencing and library preparation technologies enable even small research groups to generate chromosome-level assemblies today. Here we report the generation of an improved chromosome-level assembly for the Siamese fighting fish (Betta splendens) that was carried out during a practical university master’s course. The Siamese fighting fish is a popular aquarium fish and an emerging model species for research on aggressive behavior. We updated the current genome assembly by generating a new long-read nanopore-based assembly with subsequent scaffolding to chromosome-level using previously published Hi-C data. The use of ∼35x nanopore-based long-read data sequenced on a MinION platform (Oxford Nanopore Technologies) allowed us to generate a baseline assembly of only 1,276 contigs with a contig N50 of 2.1 Mbp, and a total length of 441 Mbp. Scaffolding using the Hi-C data resulted in 109 scaffolds with a scaffold N50 of 20.7 Mbp. More than 99% of the assembly is comprised in 21 scaffolds. The assembly showed the presence of 96.1% complete BUSCO genes from the Actinopterygii dataset indicating a high quality of the assembly. We present an improved full chromosome-level assembly of the Siamese fighting fish generated during a university master’s course. The use of ∼35× long-read nanopore data drastically improved the baseline assembly in terms of continuity. We show that relatively in-expensive high-throughput sequencing technologies such as the long-read MinION sequencing platform can be used in educational settings allowing the students to gain practical skills in modern genomics and generate high quality results that benefit downstream research projects.

Download Full-text

CRiSP: accurate structure prediction of disulfide-rich peptides with cystine-specific sequence alignment and machine learning

Bioinformatics ◽

10.1093/bioinformatics/btaa193 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3385-3392

Author(s):

Zi-Lin Liu ◽

Jing-Hao Hu ◽

Fan Jiang ◽

Yun-Dong Wu

Keyword(s):

Machine Learning ◽

Sequence Alignment ◽

Structure Prediction ◽

High Throughput Sequencing ◽

Prediction Method ◽

General Purpose ◽

Supplementary Information ◽

Model Quality ◽

Specific Sequence ◽

Structure Information

Abstract Motivation High-throughput sequencing discovers many naturally occurring disulfide-rich peptides or cystine-rich peptides (CRPs) with diversified bioactivities. However, their structure information, which is very important to peptide drug discovery, is still very limited. Results We have developed a CRP-specific structure prediction method called Cystine-Rich peptide Structure Prediction (CRiSP), based on a customized template database with cystine-specific sequence alignment and three machine-learning predictors. The modeling accuracy is significantly better than several popular general-purpose structure modeling methods, and our CRiSP can provide useful model quality estimations. Availability and implementation The CRiSP server is freely available on the website at http://wulab.com.cn/CRISP. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

IgMAT: immunoglobulin sequence multi-species annotation tool for any species including those with incomplete antibody annotation or unusual characteristics

10.1101/2021.09.22.461368 ◽

2021 ◽

Author(s):

Daniel Dorey-Robinson ◽

Giuseppe Maccari ◽

Richard Borne ◽

John A. Hammond

Keyword(s):

High Throughput Sequencing ◽

Structural Characteristics ◽

Supplementary Information ◽

Sequencing Technologies ◽

Species Lists ◽

Repertoire Sequencing ◽

Immunoglobulin Repertoire ◽

Amino Acid Alphabet ◽

Study Species ◽

Incomplete Antibody

AbstractThe advent and continual improvement of high-throughput sequencing technologies has made immunoglobulin repertoire sequencing accessible and informative regardless of study species. However, to fully map changes in polyclonal dynamics, precise annotation of these constantly rearranging genes is pivotal. For this reason, data agnostic tools able to learn from presented data are required. Most sequence annotation tools are designed primarily for use with human and mouse antibody sequences which use databases with fixed species lists, applying very specific assumptions which select against unique structural characteristics. We present IgMAT, which utilises a reduced amino acid alphabet, incorporates multiple HMM alignments into a single consensus and enables the incorporation of user defined databases to better represent their species of interest.Availability and implementationIgMAT has been developed as a python module, and is available on GitHub (https://github.com/TPI-Immunogenetics/igmat) for download under GPLv3 license.Supplementary informationModel Breakdowns

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v2 ◽

2021 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v1 ◽

2020 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text