Haplotype-aware graph indexes

AbstractMotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheelertransform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.AvailabilityOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt, and https://github.com/jltsiren/[email protected] informationSupplementary data are available.

Download Full-text

Haplotype-aware graph indexes

Bioinformatics ◽

10.1093/bioinformatics/btz575 ◽

2019 ◽

Cited By ~ 6

Author(s):

Jouni Sirén ◽

Erik Garrison ◽

Adam M Novak ◽

Benedict Paten ◽

Richard Durbin

Keyword(s):

Genetic Variation ◽

Precision Medicine ◽

Chromosome 17 ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Burrows Wheeler Transform ◽

Haplotype Information

Abstract Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability and implementation Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Crosslink: A fast, scriptable genetic mapper for outcrossing species

10.1101/135277 ◽

2017 ◽

Cited By ~ 6

Author(s):

Robert J. Vickerstaff ◽

Richard J. Harrison

Keyword(s):

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Mapping Software ◽

Outcrossing Species ◽

Supplementary Material ◽

Novel Approaches ◽

Similar Accuracy ◽

General Public License

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.

Download Full-text

GTShark: Genotype compression in large project

10.1101/494104 ◽

2018 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek

Keyword(s):

Web Site ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Large Project ◽

Supplementary Material

AbstractSummaryNowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes as well as single samples in such projects to sizes not achievable to date.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

Varstation: a complete and efficient tool to support NGS data analysis

10.1101/833582 ◽

2019 ◽

Author(s):

ACO Faria ◽

MP Caraciolo ◽

RM Minillo ◽

TF Almeida ◽

SM Pereira ◽

...

Keyword(s):

Genetic Variation ◽

Data Analysis ◽

Supplementary Information ◽

Human Genetic Variation ◽

Supplementary Data ◽

Efficient Tool ◽

Link Type ◽

Data Processor ◽

Ngs Data Analysis ◽

Ngs Data

AbstractSummaryVarstation is a cloud-based NGS data processor and analyzer for human genetic variation. This resource provides a customizable, centralized, safe and clinically validated environment aiming to improve and optimize the flow of NGS analyses and reports related with clinical and research genetics.Availability and implementationVarstation is freely available at http://varstation.com, for academic [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

dms2dfe: Comprehensive Workflow for Analysis of Deep Mutational Scanning Data

10.1101/072645 ◽

2016 ◽

Cited By ~ 2

Author(s):

Rohan Dandage ◽

Kausik Chakraborty

Keyword(s):

Noise Reduction ◽

High Throughput ◽

Critical Issue ◽

Supplementary Information ◽

Supplementary Data ◽

Selection Pressures ◽

Link Type ◽

Supplementary Material ◽

End To End ◽

Python Package

SummaryHigh throughput genotype to phenotype (G2P) data is increasingly being generated by widely applicable Deep Mutational Scanning (DMS) method. dms2dfe is a comprehensive end-to-end workflow that addresses critical issue with noise reduction and offers variety of crucial downstream analyses. Noise reduction is carried out by normalizing counts of mutants by depth of sequencing and subsequent dispersion shrinkage at the level of calculation of preferential enrichments. In downstream analyses, dms2dfe workflow provides identification of relative selection pressures, potential molecular constraints and generation of data-rich visualizations.Availabilitydms2dfe is implemented as a python package and it is available at https://kc-lab.github.io/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

Bioinformatics ◽

10.1093/bioinformatics/btaa1081 ◽

2021 ◽

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Best Practices ◽

Quality Metrics ◽

Supplementary Information ◽

Public Research ◽

Supplementary Data ◽

Quality Improvements ◽

1000 Genomes Project ◽

Individual Level ◽

1000 Genomes ◽

Population Scale

Abstract Motivation Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. Results We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline. Availability and Implementation We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

hts-nim: scripting high-performance genomic analyses

10.1101/261735 ◽

2018 ◽

Author(s):

Brent S. Pedersen ◽

Aaron R. Quinlan

Keyword(s):

High Performance ◽

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scripting Languages ◽

Link Type ◽

Custom Software ◽

Genomic Analyses ◽

Biological Insight ◽

Supplementary Material

AbstractMotivationExtracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets.ResultsWe present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance.Availabilityhts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

mirtronDB: a mirtron knowledge base

10.1101/429522 ◽

2018 ◽

Author(s):

Bruno Henrique Ribeiro Da Fonseca ◽

Douglas Silva Domingues ◽

Alexandre Rossi Paschoal

Keyword(s):

Knowledge Base ◽

Supplementary Information ◽

Supplementary Data ◽

Knowledge Database ◽

Group Type ◽

Link Type ◽

Supplementary Material ◽

Access To Knowledge ◽

User Friendly ◽

Organism Group

AbstractMotivationMirtrons are originated from short introns with atypical cleavage from the miRNA canonical pathway by using the splicing mechanism. Several studies describe mirtrons in chordates, invertebrates and plants but in the current literature there is no repository that centralizes and organizes these public and available data. To fill this gap, we created the first knowledge database dedicated to mirtron, called mirtronDB, available at http://mirtrondb.cp.utfpr.edu.br/. MirtronDB has a total of 1,407 mirtron precursors and 2,426 mirtron mature sequences in 18 species.ResultsThrough a user-friendly interface, users can browse and search mirtrons by organism, organism group, type and name. MirtronDB is a specialized resource to explore mirtrons and their regulations, providing free, user-friendly access to knowledge on mirtron data.AvailabilityMirtronDB is available at http://mirtrondb.cp.utfpr.edu.br/[email protected] informationSupplementary data are available.

Download Full-text

deSPI: efficient classification of metagenomic reads with lightweight de Bruijn graph-based reference indexing

10.1101/080200 ◽

2016 ◽

Cited By ~ 1

Author(s):

Dengfeng Guan ◽

Bo Liu ◽

Yadong Wang

Keyword(s):

Source Code ◽

Classification Method ◽

Supplementary Information ◽

De Bruijn Graph ◽

Supplementary Data ◽

Link Type ◽

Memory Footprint ◽

Supplementary Material ◽

De Bruijn

AbstractSummaryIn metagenomic studies, fast and effective tools are on wide demand to implement taxonomy classification for upto billions of reads. Herein, we propose deSPI, a novel read classification method that classifies reads by recognizing and analyzing the matches between reads and reference with de Bruijn graph-based lightweight reference indexing. deSPI has faster speed with relatively small memory footprint, meanwhile, it can also achieve higher or similar sensitivity and accuracy.Availabilitythe C++ source code of deSPI is available at https://github.com/hitbc/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Tracking cytosine depletion in SARS-CoV-2

10.1101/2020.10.26.354787 ◽

2020 ◽

Author(s):

Ruibang Luo ◽

Yat-Sing Wong ◽

Tak-Wah Lam

Keyword(s):

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

Composition Change ◽

Link Type ◽

Over Time

AbstractMotivationDanchin et al. have pointed out that cytosine drives the evolution of SARS-CoV-2. A depletion of cytosine might lead to the attenuation of SARS-CoV-2.ResultsWe built a website to track the composition change of mono-, di-, and tri-nucleotide of SARS-CoV-2 over time. The website downloads new strains available from GISAID and updates its results daily. Our analysis suggests that the composition of cytosine in coronaviruses is related to their reported mortality. Using 137,315 SARS-CoV-2 strains collected in ten months, we observed cytosine depletion at a rate of about one cytosine loss per month from the whole genome.AvailabilityThe website is available at http://www.bio8.cs.hku.hk/sarscov2/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text