PgRC: pseudogenome-based read compressor

Tomasz M Kowalski; Szymon Grabowski

doi:10.1093/bioinformatics/btz919

PgRC: pseudogenome-based read compressor

Bioinformatics ◽

10.1093/bioinformatics/btz919 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2082-2089 ◽

Cited By ~ 2

Author(s):

Tomasz M Kowalski ◽

Szymon Grabowski

Keyword(s):

Compression Ratio ◽

High Throughput Sequencing ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Significant Interest ◽

The One ◽

Shortest Common Superstring

Abstract Motivation The amount of sequencing data from high-throughput sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. Results We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 15 and 20% on average, respectively, while being comparably fast in decompression. Availability and implementation PgRC can be downloaded from https://github.com/kowallus/PgRC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PgRC: Pseudogenome based Read Compressor

10.1101/710822 ◽

2019 ◽

Author(s):

Tomasz Kowalski ◽

Szymon Grabowski

Keyword(s):

High Throughput ◽

Compression Ratio ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Quality ◽

Link Type ◽

Sequencing Technologies ◽

Significant Interest ◽

The One ◽

Shortest Common Superstring

AbstractMotivationThe amount of sequencing data from High-Throughput Sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources.ResultsWe present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression.AvailabilityPgRC can be downloaded from https://github.com/kowallus/[email protected]

Download Full-text

Engineering the Compression of Sequencing Reads

10.1101/2020.05.01.071720 ◽

2020 ◽

Author(s):

Tomasz Kowalski ◽

Szymon Grabowski

Keyword(s):

High Throughput ◽

Compression Ratio ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Current Version ◽

High Quality ◽

High Throughput Sequencing Data ◽

Compression Time ◽

Practical Performance ◽

Shortest Common Superstring

AbstractMotivationFASTQ remains among the widely used formats for high-throughput sequencing data. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs.ResultsWe present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. The current version, v1.2, practically preserves the compression ratio and decompression speed of the previous one, reducing the compression time by a factor of about 4–5 on a 6-core/12-thread machine.AvailabilityPgRC 1.2 can be downloaded from https://github.com/kowallus/[email protected]

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

NetCoMi: network construction and comparison for microbiome data in R

Briefings in Bioinformatics ◽

10.1093/bib/bbaa290 ◽

2020 ◽

Author(s):

Stefanie Peschel ◽

Christian L Müller ◽

Erika von Mutius ◽

Anne-Laure Boulesteix ◽

Martin Depner

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Network Construction ◽

Microbial Association ◽

High Throughput Sequencing Data ◽

Microbial Associations ◽

Microbiome Data

Abstract Motivation Estimating microbial association networks from high-throughput sequencing data is a common exploratory data analysis approach aiming at understanding the complex interplay of microbial communities in their natural habitat. Statistical network estimation workflows comprise several analysis steps, including methods for zero handling, data normalization and computing microbial associations. Since microbial interactions are likely to change between conditions, e.g. between healthy individuals and patients, identifying network differences between groups is often an integral secondary analysis step. Thus far, however, no unifying computational tool is available that facilitates the whole analysis workflow of constructing, analysing and comparing microbial association networks from high-throughput sequencing data. Results Here, we introduce NetCoMi (Network Construction and comparison for Microbiome data), an R package that integrates existing methods for each analysis step in a single reproducible computational workflow. The package offers functionality for constructing and analysing single microbial association networks as well as quantifying network differences. This enables insights into whether single taxa, groups of taxa or the overall network structure change between groups. NetCoMi also contains functionality for constructing differential networks, thus allowing to assess whether single pairs of taxa are differentially associated between two groups. Furthermore, NetCoMi facilitates the construction and analysis of dissimilarity networks of microbiome samples, enabling a high-level graphical summary of the heterogeneity of an entire microbiome sample collection. We illustrate NetCoMi’s wide applicability using data sets from the GABRIELA study to compare microbial associations in settled dust from children’s rooms between samples from two study centers (Ulm and Munich). Availability R scripts used for producing the examples shown in this manuscript are provided as supplementary data. The NetCoMi package, together with a tutorial, is available at https://github.com/stefpeschel/NetCoMi. Contact Tel:+49 89 3187 43258; [email protected] Supplementary information Supplementary data are available at Briefings in Bioinformatics online.

Download Full-text

SVJedi: genotyping structural variations with long reads

Bioinformatics ◽

10.1093/bioinformatics/btaa527 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4568-4575

Author(s):

Lolita Lecompte ◽

Pierre Peterlongo ◽

Dominique Lavenier ◽

Claire Lemaitre

Keyword(s):

Supplementary Information ◽

Sequencing Data ◽

Structural Variations ◽

Short Read ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Clinical Diagnoses ◽

Long Read ◽

The One

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

circtools—a one-stop software solution for circular RNA research

Bioinformatics ◽

10.1093/bioinformatics/bty948 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2326-2328 ◽

Cited By ~ 13

Author(s):

Tobias Jakobi ◽

Alexey Uvarovskii ◽

Christoph Dieterich

Keyword(s):

High Throughput Sequencing ◽

Circular Rna ◽

Statistical Testing ◽

Supplementary Information ◽

Circular Rnas ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Multi Stage ◽

Sequence Reconstruction ◽

One Stop

Abstract Motivation Circular RNAs (circRNAs) originate through back-splicing events from linear primary transcripts, are resistant to exonucleases, are not polyadenylated and have been shown to be highly specific for cell type and developmental stage. CircRNA detection starts from high-throughput sequencing data and is a multi-stage bioinformatics process yielding sets of potential circRNA candidates that require further analyses. While a number of tools for the prediction process already exist, publicly available analysis tools for further characterization are rare. Our work provides researchers with a harmonized workflow that covers different stages of in silico circRNA analyses, from prediction to first functional insights. Results Here, we present circtools, a modular, Python-based framework for computational circRNA analyses. The software includes modules for circRNA detection, internal sequence reconstruction, quality checking, statistical testing, screening for enrichment of RBP binding sites, differential exon RNase R resistance and circRNA-specific primer design. circtools supports researchers with visualization options and data export into commonly used formats. Availability and implementation circtools is available via https://github.com/dieterich-lab/circtools and http://circ.tools under GPLv3.0. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

10.1101/2020.10.27.356907 ◽

2020 ◽

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Genetic Information ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Diploid Genome ◽

Insertions And Deletions ◽

Structural Variant ◽

Sequencing Technologies ◽

Variant Detection ◽

Genome Assemblies

AbstractMotivationWith the availability of new sequencing technologies, the generation of haplotype-resolved genome assemblies up to chromosome scale has become feasible. These assemblies capture the complete genetic information of both parental haplotypes, increase structural variant (SV) calling sensitivity and enable direct genotyping and phasing of SVs. Yet, existing SV callers are designed for haploid genome assemblies only, do not support genotyping or detect only a limited set of SV classes.ResultsWe introduce our method SVIM-asm for the detection and genotyping of six common classes of SVs from haploid and diploid genome assemblies. Compared against the only other existing SV caller for diploid assemblies, DipCall, SVIM-asm detects more SV classes and reached higher F1 scores for the detection of insertions and deletions on two recently published assemblies of the HG002 individual.Availability and ImplementationSVIM-asm has been implemented in Python and can be easily installed via bioconda. Its source code is available at github.com/eldariont/[email protected] informationSupplementary data are available online.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text