deSPI: efficient classification of metagenomic reads with lightweight de Bruijn graph-based reference indexing

AbstractSummaryIn metagenomic studies, fast and effective tools are on wide demand to implement taxonomy classification for upto billions of reads. Herein, we propose deSPI, a novel read classification method that classifies reads by recognizing and analyzing the matches between reads and reference with de Bruijn graph-based lightweight reference indexing. deSPI has faster speed with relatively small memory footprint, meanwhile, it can also achieve higher or similar sensitivity and accuracy.Availabilitythe C++ source code of deSPI is available at https://github.com/hitbc/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Faucet: streaming de novo assembly graph construction

10.1101/125658 ◽

2017 ◽

Author(s):

Roye Rozov ◽

Gil Goldshlager ◽

Eran Halperin ◽

Ron Shamir

Keyword(s):

Resource Use ◽

De Novo ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Assembly Quality ◽

Metagenome Assembly ◽

Streaming Algorithm ◽

Supplementary Material ◽

De Bruijn

AbstractMotivationWe present Faucet, a 2-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.ResultsFaucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata - coverage counts collected at junction k-mers and connections bridging between junction pairs - contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Faucet’s resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency - namely, Minia and LightAssembler. However, on metagenomes tested, Faucet’s outputs had 14-110% higher mean NGA50 lengths compared to Minia, and 2-11-fold higher mean NGA50 lengths compared to LightAssembler, the only other streaming assembler available.AvailabilityFaucet is available at https://github.com/Shamir-Lab/[email protected],[email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

Crosslink: A fast, scriptable genetic mapper for outcrossing species

10.1101/135277 ◽

2017 ◽

Cited By ~ 6

Author(s):

Robert J. Vickerstaff ◽

Richard J. Harrison

Keyword(s):

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Mapping Software ◽

Outcrossing Species ◽

Supplementary Material ◽

Novel Approaches ◽

Similar Accuracy ◽

General Public License

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.

Download Full-text

PhyloFold: Precise and Swift Prediction of RNA Secondary Structures to Incorporate Phylogeny among Homologs

10.1101/2020.03.05.975797 ◽

2020 ◽

Author(s):

Masaki Tagashira

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Prediction Accuracy ◽

Structural Alignment ◽

Source Code ◽

Secondary Structures ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Structural Alignments

AbstractMotivationThe simultaneous consideration of sequence alignment and RNA secondary structure, or structural alignment, is known to help predict more accurate secondary structures of homologs. However, the consideration is heavy and can be done only roughly to decompose structural alignments.ResultsThe PhyloFold method, which predicts secondary structures of homologs considering likely pairwise structural alignments, was developed in this study. The method shows the best prediction accuracy while demanding comparable running time compared to conventional methods.AvailabilityThe source code of the programs implemented in this study is available on “https://github.com/heartsh/phylofold” and “https://github.com/heartsh/phyloalifold“.Contact“[email protected]”.Supplementary informationSupplementary data are available.

Download Full-text

GTShark: Genotype compression in large project

10.1101/494104 ◽

2018 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek

Keyword(s):

Web Site ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Large Project ◽

Supplementary Material

AbstractSummaryNowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes as well as single samples in such projects to sizes not achievable to date.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

Haplotype-aware graph indexes

10.1101/559583 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jouni Sirén ◽

Erik Garrison ◽

Adam M. Novak ◽

Benedict Paten ◽

Richard Durbin

Keyword(s):

Genetic Variation ◽

Chromosome 17 ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Link Type ◽

Supplementary Material ◽

Haplotype Information

AbstractMotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheelertransform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.AvailabilityOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt, and https://github.com/jltsiren/[email protected] informationSupplementary data are available.

Download Full-text

dms2dfe: Comprehensive Workflow for Analysis of Deep Mutational Scanning Data

10.1101/072645 ◽

2016 ◽

Cited By ~ 2

Author(s):

Rohan Dandage ◽

Kausik Chakraborty

Keyword(s):

Noise Reduction ◽

High Throughput ◽

Critical Issue ◽

Supplementary Information ◽

Supplementary Data ◽

Selection Pressures ◽

Link Type ◽

Supplementary Material ◽

End To End ◽

Python Package

SummaryHigh throughput genotype to phenotype (G2P) data is increasingly being generated by widely applicable Deep Mutational Scanning (DMS) method. dms2dfe is a comprehensive end-to-end workflow that addresses critical issue with noise reduction and offers variety of crucial downstream analyses. Noise reduction is carried out by normalizing counts of mutants by depth of sequencing and subsequent dispersion shrinkage at the level of calculation of preferential enrichments. In downstream analyses, dms2dfe workflow provides identification of relative selection pressures, potential molecular constraints and generation of data-rich visualizations.Availabilitydms2dfe is implemented as a python package and it is available at https://kc-lab.github.io/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

LIONS: Analysis Suite for Detecting and Quantifying Transposable Element Initiated Transcription from RNA-seq

10.1101/149864 ◽

2017 ◽

Cited By ~ 2

Author(s):

Artem Babaian ◽

Richard Thompson ◽

Jake Lever ◽

Liane Gagnier ◽

Mohammad M. Karimi ◽

...

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Test Data ◽

Source Code ◽

Supplementary Information ◽

Transcriptional Networks ◽

Supplementary Data ◽

Rna Seq ◽

Instruction Manual ◽

Supplementary Material

AbstractSummaryTransposable Elements (TEs) influence the evolution of novel transcriptional networks yet the specific and meaningful interpretation of how TE-initiation events contribute to the transcriptome has been marred by computational and methodological deficiencies. We developed LIONS for the analysis of paired-end RNA-seq data to specifically detect and quantify TE-initiated transcripts.AvailabilitySource code, container, test data and instruction manual are freely available at www.github.com/ababaian/[email protected] or [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

MBG: Minimizer-based Sparse de Bruijn Graph Construction

10.1101/2020.09.18.303156 ◽

2020 ◽

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Source Code ◽

Error Rates ◽

Read Length ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

E Coli ◽

Link Type ◽

Sequencing Technologies ◽

Long Read ◽

De Bruijn

MotivationDe Bruijn graphs can be constructed from short reads efficiently and have been used for many purposes. Traditionally long read sequencing technologies have had too high error rates for de Bruijn graph-based methods. Recently, HiFi reads have provided a combination of long read length and low error rate, which enables de Bruijn graphs to be used with HiFi reads.ResultsWe have implemented MBG, a tool for building sparse de Bruijn graphs from HiFi reads. MBG outperforms existing tools for building dense de Bruijn graphs, and can build a graph of 50x coverage whole human genome HiFi reads in four hours on a single core. MBG also assembles the bacterial E. coli genome into a single contig in 8 seconds.AvailabilityPackage manager: https://anaconda.org/bioconda/mbg and source code: https://github.com/maickrau/MBG

Download Full-text

hts-nim: scripting high-performance genomic analyses

10.1101/261735 ◽

2018 ◽

Author(s):

Brent S. Pedersen ◽

Aaron R. Quinlan

Keyword(s):

High Performance ◽

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scripting Languages ◽

Link Type ◽

Custom Software ◽

Genomic Analyses ◽

Biological Insight ◽

Supplementary Material

AbstractMotivationExtracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets.ResultsWe present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance.Availabilityhts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

mirtronDB: a mirtron knowledge base

10.1101/429522 ◽

2018 ◽

Author(s):

Bruno Henrique Ribeiro Da Fonseca ◽

Douglas Silva Domingues ◽

Alexandre Rossi Paschoal

Keyword(s):

Knowledge Base ◽

Supplementary Information ◽

Supplementary Data ◽

Knowledge Database ◽

Group Type ◽

Link Type ◽

Supplementary Material ◽

Access To Knowledge ◽

User Friendly ◽

Organism Group

AbstractMotivationMirtrons are originated from short introns with atypical cleavage from the miRNA canonical pathway by using the splicing mechanism. Several studies describe mirtrons in chordates, invertebrates and plants but in the current literature there is no repository that centralizes and organizes these public and available data. To fill this gap, we created the first knowledge database dedicated to mirtron, called mirtronDB, available at http://mirtrondb.cp.utfpr.edu.br/. MirtronDB has a total of 1,407 mirtron precursors and 2,426 mirtron mature sequences in 18 species.ResultsThrough a user-friendly interface, users can browse and search mirtrons by organism, organism group, type and name. MirtronDB is a specialized resource to explore mirtrons and their regulations, providing free, user-friendly access to knowledge on mirtron data.AvailabilityMirtronDB is available at http://mirtrondb.cp.utfpr.edu.br/[email protected] informationSupplementary data are available.

Download Full-text