fastq file Latest Research Papers

Demultiplexing Nanopore reads with LAST v2

10.17504/protocols.io.b28fqhtn ◽

2021 ◽

Author(s):

David A Eccles

Keyword(s):

Fasta File ◽

Manual Method ◽

Fastq File

This protocol is for a semi-manual method for read demultiplexing, as used after my presentation Sequencing DNA with Linux Cores and Nanopores to work out the number of reads captured by different barcodes. Input: reads as a FASTQ file, barcode sequences as a FASTA file Output: reads split into single FASTQ files per target [barcode] Note: barcode / adapter sequences are not trimmed by this protocol

Download Full-text

RENANO: a REference-based compressor for NANOpore FASTQ files

10.1101/2021.03.26.437155 ◽

2021 ◽

Author(s):

Guillermo Dufort y Alvarez ◽

Gadiel Seroussi ◽

Pablo Smircich ◽

Jose Roberto Sotelo ◽

Idoia Ochoa ◽

...

Keyword(s):

Reference Genome ◽

The Other ◽

Nanopore Sequencing ◽

Total Size ◽

Base Call ◽

Sequencing Technologies ◽

Fastq File ◽

Average Improvement ◽

Call Sequence ◽

And Storage

Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in less than 72 hs). In order to reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. Unlike short-read technologies, nanopore sequencing generates long noisy reads of variable length. In this note we introduce RENANO, a reference-based lossless FASTQ data compressor, specifically tailored to compress FASTQ files generated with nanopore sequencing technologies. RENANO builds on the recent compressor ENANO, which is currently state of the art. It focuses on improving the compression of the base call sequence portion of the FASTQ file, leaving the other parts of ENANO intact. Two novel reference-based compression algorithms are introduced, contemplating different scenarios: in the first scenario, a reference genome is available without cost to both the compressor and the decompressor; in the second, the reference genome is available only on the compressor side, and a compacted version of the reference is transmitted to the decompressor as part of the compressed file. To evaluate the proposed algorithms, we compare RENANO against ENANO on several publicly available nanopore datasets. In the first scenario considered, RENANO improves the base call sequences compression of ENANO by 40.8%, on average, over all the datasets. As for total compression (including the other parts of the FASTQ file), the average improvement is 13.1%. In the second scenario considered, the base call compression improvements of RENANO over ENANO range from 15.2% to 49.0%, depending on the coverage of the compressed dataset, while in terms of total size, the improvements range from 5.1% to 16.5%.

Download Full-text

Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities

International Journal of Molecular Sciences ◽

10.3390/ijms19113687 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3687

Author(s):

Wolfgang Kaisers ◽

Holger Schwender ◽

Heiner Schaal

Keyword(s):

Hierarchical Clustering ◽

Sequential Analysis ◽

Cell Types ◽

R Package ◽

Dermal Fibroblasts ◽

Jurkat Cells ◽

Batch Effects ◽

Tree Structures ◽

Fastq File ◽

Rnaseq Data

We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.

Download Full-text

A Compression Algorithm of Fastq File Based on Distribution Characteristics Analysis

2018 13th International Conference on Computer Science & Education (ICCSE) ◽

10.1109/iccse.2018.8468742 ◽

2018 ◽

Cited By ~ 1

Author(s):

Shengyu Lu ◽

Hanping Chen ◽

Lifa Peng ◽

Beizhan Wang ◽

Hongji Wang ◽

...

Keyword(s):

Compression Algorithm ◽

Distribution Characteristics ◽

Fastq File ◽

Characteristics Analysis

Download Full-text

Quality Assessment of High-throughput DNA Sequencing Data via Range analysis

10.1101/101469 ◽

2017 ◽

Author(s):

M. Oğuzhan Külekci ◽

Ali Fotouhi ◽

Mina Majidi

Keyword(s):

Quality Assessment ◽

Software Tool ◽

Threshold Value ◽

Sequencing Data ◽

Range Analysis ◽

Statistical Parameters ◽

Mean Values ◽

Fastq File ◽

High Throughput Dna Sequencing ◽

Assessment Procedures

AbstractIn the recent literature there appeared a number of studies for the quality assessment of sequencing data. These efforts, to a great extent, focused on reporting the statistical parameters regarding to the distribution of the quality scores and/or the base-calls in a FASTQ file. We investigate another dimension for the quality assessment motivated with the fact that reads including long intervals having fewer errors improve the performances of the post-processing tools in the down-stream analysis. Thus, the quality assessment procedures proposed in this study aim to analyze the segments on the reads that are above a certain quality. We define an interval of a read to be of desired quality when there are at most k quality scores less than or equal to a threshold value v, for some v and k provided by the user. We present the algorithm to detect those ranges and introduce new metrics computed from their lengths. These metrics include the mean values for the longest, shortest, average, cubic average, and average variation coefficient of the fragment lengths that are appropriate according to the v and k input parameters. We provide a new software tool QASDRA for quality assessment of sequencing data via range analysis. QASDRA, implemented in Python, and publicly available at https://github.com/ali-cp/QASDRA.git, creates the quality assessment report of an input FASTQ file according to the user specified k and v parameters. It also has the capabilities to filter out the reads according to the metrics introduced.

Download Full-text

LCTD: A lossless compression tool of FASTQ file based on transformation of original file distribution

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822639 ◽

2016 ◽

Author(s):

Jiabing Fu ◽

Yacong Ma ◽

Bixin Ke ◽

Shoubin Dong

Keyword(s):

Lossless Compression ◽

Fastq File ◽

File Distribution ◽

Original File

Download Full-text

VSEARCH: a versatile open source tool for metagenomics

PeerJ ◽

10.7717/peerj.2584 ◽

2016 ◽

Vol 4 ◽

pp. e2584 ◽

Cited By ~ 2221

Author(s):

Torbjørn Rognes ◽

Tomáš Flouri ◽

Ben Nichols ◽

Christopher Quince ◽

Frédéric Mahé

Keyword(s):

Open Source ◽

Population Genomics ◽

De Novo ◽

Sequence Data ◽

Pairwise Alignment ◽

Nucleotide Sequences ◽

Global Alignment ◽

Nucleotide Sequence Data ◽

Fastq File ◽

Target Sequences

BackgroundVSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use.MethodsWhen searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads.ResultsVSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based orde novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available athttps://github.com/torognes/vsearchunder either the BSD 2-clause license or the GNU General Public License version 3.0.DiscussionVSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Download Full-text

VSEARCH: a versatile open source tool for metagenomics

10.7287/peerj.preprints.2409 ◽

2016 ◽

Cited By ~ 9

Author(s):

Torbjørn Rognes ◽

Tomáš Flouri ◽

Ben Nichols ◽

Christopher Quince ◽

Frédéric Mahé

Keyword(s):

Open Source ◽

De Novo ◽

Sequence Data ◽

Pairwise Alignment ◽

Low Complexity ◽

Nucleotide Sequences ◽

Global Alignment ◽

Nucleotide Sequence Data ◽

Fastq File ◽

Target Sequences

Background. VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing metagenomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods. When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results. VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e. format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion. VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Download Full-text

VSEARCH: a versatile open source tool for metagenomics

10.7287/peerj.preprints.2409v1 ◽

2016 ◽

Cited By ~ 25

Author(s):

Torbjørn Rognes ◽

Tomáš Flouri ◽

Ben Nichols ◽

Christopher Quince ◽

Frédéric Mahé

Keyword(s):

Open Source ◽

De Novo ◽

Sequence Data ◽

Pairwise Alignment ◽

Low Complexity ◽

Nucleotide Sequences ◽

Global Alignment ◽

Nucleotide Sequence Data ◽

Fastq File ◽

Target Sequences

Background. VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing metagenomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods. When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results. VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e. format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion. VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Download Full-text

Multi-Institutional FASTQ File Exchange as a Means of Proficiency Testing for Next-Generation Sequencing Bioinformatics and Variant Interpretation

Journal of Molecular Diagnostics ◽

10.1016/j.jmoldx.2016.03.002 ◽

2016 ◽

Vol 18 (4) ◽

pp. 572-579 ◽

Cited By ~ 21

Author(s):

Kurtis D. Davies ◽

Midhat S. Farooqi ◽

Mike Gruidl ◽

Charles E. Hill ◽

Julie Woolworth-Hirschhorn ◽

...

Keyword(s):

Next Generation Sequencing ◽

Proficiency Testing ◽

Variant Interpretation ◽

Next Generation ◽

Fastq File ◽

Generation Sequencing

Download Full-text

fastq file
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Demultiplexing Nanopore reads with LAST v2

RENANO: a REference-based compressor for NANOpore FASTQ files

Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities

A Compression Algorithm of Fastq File Based on Distribution Characteristics Analysis

Quality Assessment of High-throughput DNA Sequencing Data via Range analysis

LCTD: A lossless compression tool of FASTQ file based on transformation of original file distribution

VSEARCH: a versatile open source tool for metagenomics

VSEARCH: a versatile open source tool for metagenomics

VSEARCH: a versatile open source tool for metagenomics

Multi-Institutional FASTQ File Exchange as a Means of Proficiency Testing for Next-Generation Sequencing Bioinformatics and Variant Interpretation

Export Citation Format

fastq fileRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Demultiplexing Nanopore reads with LAST v2

RENANO: a REference-based compressor for NANOpore FASTQ files

Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities

A Compression Algorithm of Fastq File Based on Distribution Characteristics Analysis

Quality Assessment of High-throughput DNA Sequencing Data via Range analysis

LCTD: A lossless compression tool of FASTQ file based on transformation of original file distribution

VSEARCH: a versatile open source tool for metagenomics

VSEARCH: a versatile open source tool for metagenomics

VSEARCH: a versatile open source tool for metagenomics

Multi-Institutional FASTQ File Exchange as a Means of Proficiency Testing for Next-Generation Sequencing Bioinformatics and Variant Interpretation

fastq file
Recently Published Documents