KAGE: Fast alignment-free graph-based genotyping of SNPs and short indels

AbstractOne of the core applications of high-throughput sequencing is the characterization of individual genetic variation. Traditionally, variants have been inferred by comparing sequenced reads to a reference genome. There has recently been an emergence of genotyping methods, which instead infer variants of an individual based on variation present in population-scale repositories like the 1000 Genomes Project. However, commonly used methods for genotyping are slow since they still require mapping of reads to a reference genome. Also, since traditional reference genomes do not include genetic variation, traditional genotypers suffer from reference bias and poor accuracy in variation-rich regions where reads cannot accurately be mapped.We here present KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free genotyping. We propose two novel ideas to improve both the speed and accuracy: we (1) use known genotypes from thousands of individuals in a Bayesian model to predict genotypes, and (2) propose a computationally efficient method for leveraging correlation between variants.We show through experiments on experimental data that KAGE is both faster and more accurate than other alignment-free genotypers. KAGE is able to genotype a new sample (15x coverage) in less than half an hour on a consumer laptop, more than 10 times faster than the fastest existing methods, making it ideal in clinical settings or when large numbers of individuals are to be genotyped at low computational cost.

Download Full-text

Reference flow: reducing reference bias using multiple population genomes

Genome Biology ◽

10.1186/s13059-020-02229-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nae-Chyun Chen ◽

Brad Solomon ◽

Taher Mun ◽

Sheila Iyer ◽

Ben Langmead

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Alignment Method ◽

Sequencing Data ◽

Computational Overhead ◽

Reference Flow ◽

Multiple Population ◽

Reference Bias ◽

Flow Alignment ◽

Reference Genomes

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Reducing reference bias using multiple population reference genomes

10.1101/2020.03.03.975219 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nae-Chyun Chen ◽

Brad Solomon ◽

Taher Mun ◽

Sheila Iyer ◽

Ben Langmead

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Alignment Method ◽

Sequencing Data ◽

Computational Overhead ◽

Reference Flow ◽

Multiple Population ◽

Reference Bias ◽

Flow Alignment ◽

Reference Genomes

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the “reference flow” alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance, but with 14% of the memory footprint and 5.5 times the speed.

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Systematic benchmark of ancient DNA read mapping

Briefings in Bioinformatics ◽

10.1093/bib/bbab076 ◽

2021 ◽

Author(s):

Adrien Oliva ◽

Raymond Tobler ◽

Alan Cooper ◽

Bastien Llamas ◽

Yassine Souilmi

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Population Genetic ◽

Reference Genome ◽

Population Data ◽

Human Populations ◽

Current Standard ◽

Read Mapping ◽

Reference Bias ◽

The Impact

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.

Download Full-text

Compressed pseudo-SLAM: pseudorange-integrated compressed simultaneous localisation and mapping for unmanned aerial vehicle navigation

Journal of Navigation ◽

10.1017/s037346332100031x ◽

2021 ◽

pp. 1-13

Author(s):

Jonghyuk Kim ◽

Jose Guivant ◽

Martin L. Sollie ◽

Torleiv H. Bryne ◽

Tor Arne Johansen

Keyword(s):

Unmanned Aerial Vehicle ◽

Computational Cost ◽

Satellite System ◽

Measurement Unit ◽

Electronic Systems ◽

Computationally Efficient ◽

Global Correlation ◽

Aerial Vehicle ◽

Correlation Information ◽

Global Navigation Satellite

Abstract This paper addresses the fusion of the pseudorange/pseudorange rate observations from the global navigation satellite system and the inertial–visual simultaneous localisation and mapping (SLAM) to achieve reliable navigation of unmanned aerial vehicles. This work extends the previous work on a simulation-based study [Kim et al. (2017). Compressed fusion of GNSS and inertial navigation with simultaneous localisation and mapping. IEEE Aerospace and Electronic Systems Magazine, 32(8), 22–36] to a real-flight dataset collected from a fixed-wing unmanned aerial vehicle platform. The dataset consists of measurements from visual landmarks, an inertial measurement unit, and pseudorange and pseudorange rates. We propose a novel all-source navigation filter, termed a compressed pseudo-SLAM, which can seamlessly integrate all available information in a computationally efficient way. In this framework, a local map is dynamically defined around the vehicle, updating the vehicle and local landmark states within the region. A global map includes the rest of the landmarks and is updated at a much lower rate by accumulating (or compressing) the local-to-global correlation information within the filter. It will show that the horizontal navigation error is effectively constrained with one satellite vehicle and one landmark observation. The computational cost will be analysed, demonstrating the efficiency of the method.

Download Full-text

A novel nonsense variant of the AGXT identified in a Chinese family: special variant research in the Chinese reference genome

BMC Nephrology ◽

10.1186/s12882-021-02276-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Chang Bao Xu ◽

Xu Dong Zhou ◽

Hong En Xu ◽

Yong Li Zhao ◽

Xing Hua Zhao ◽

...

Keyword(s):

High Throughput Sequencing ◽

Reference Genome ◽

Genetic Diagnosis ◽

Primary Hyperoxaluria ◽

Missense Variant ◽

Population Heterogeneity ◽

Chinese Family ◽

Pathogenic Variants ◽

Variation Database ◽

Genetic Level

Abstract Background Primary hyperoxaluria(PH)is a rare autosomal recessive genetic disease that contains three subtypes (PH1, PH2 and PH3). Approximately 80% of PH patients has been reported as subtype PH1, this subtype of PH has been related to a higher risk of renal failure at any age. Several genetic studies indicate that the variants in gene AGXT are responsible for the occurrence of PH1. However, the population heterogeneity of the variants in AGXT makes the genetic diagnosis of PH1 more challenging as it is hard to locate each specific variant. It is valuable to have a complete spectrum of AGXT variants from different population for early diagnosis and clinical treatments of PH1. Case presentation In this study, We performed high-throughput sequencing and genetic analysis of a 6-year-old male PH1 patient from a Chinese family. Two variants (c.346G > A: p.Gly116Arg; c.864G > A: p.Trp288X) of the gene AGXT were identified. We found a nonsense variant (c.864G > A: p.Trp288X) that comes from the proband’s mother and has never been reported previously. The other missense variant (c.346G > A: p.Gly116Arg) was inherited from his father and has been found previously in a domain of aminotransferase, which plays an important role in the function of AGT protein. Furthermore, we searched 110 pathogenic variants of AGXT that have been reported worldwide in healthy local Chinese population, none of these pathogenic variants was detected in the local genomes. Conclusions Our research provides an important diagnosis basis for PH1 on the genetic level by updating the genotype of PH1 and also develops a better understanding of the variants in AGXT by broadening the variation database of AGXT according to the Chinese reference genome.

Download Full-text

Hybrid LES Approach for Practical Turbomachinery Flows—Part I: Hierarchy and Example Simulations

Journal of Turbomachinery ◽

10.1115/1.4003061 ◽

2011 ◽

Vol 134 (2) ◽

Cited By ~ 16

Author(s):

Paul Tucker ◽

Simon Eastwood ◽

Christian Klostermeier ◽

Richard Jefferson-Loveday ◽

James Tyacke ◽

...

Keyword(s):

Convex Surface ◽

Computational Cost ◽

Companion Paper ◽

Navier Stokes ◽

Computationally Efficient ◽

Turbulent Structures ◽

Hybrid Strategy ◽

Eddy Simulation ◽

Related Approach ◽

Les Model

Unlike Reynolds-averaged Navier–Stokes (RANS) models that need calibration for different flow classes, LES (where larger turbulent structures are resolved by the grid and smaller modeled in a fashion reminiscent of RANS) offers the opportunity to resolve geometry dependent turbulence as found in complex internal flows—albeit at substantially higher computational cost. Based on the results for a broad range of studies involving different numerical schemes, large eddy simulation (LES) models and grid topologies, an LES hierarchy and hybrid LES related approach is proposed. With the latter, away from walls, no LES model is used, giving what can be termed numerical LES (NLES). This is relatively computationally efficient and makes use of the dissipation present in practical industrial computational fluid dynamics (CFD) programs. Near walls, RANS modeling is used to cover over numerous small structures, the LES resolution of which is generally intractable with current computational power. The linking of the RANS and NLES zones through a Hamilton–Jacobi equation is advocated. The RANS-NLES hybridization makes further sense for compressible flow solvers, where, as the Mach number tends to zero at walls, excessive dissipation can occur. The hybrid strategy is used to predict flow over a rib roughened surface and a jet impinging on a convex surface. These cases are important for blade cooling and show encouraging results. Further results are presented in a companion paper.

Download Full-text

An Efficient Preconditioner for Linear System Solution in Multi-Domain Modeling of the Circulatory System

Volume 1A: Abdominal Aortic Aneurysms; Active and Reactive Soft Matter; Atherosclerosis; BioFluid Mechanics; Education; Biotransport Phenomena; Bone, Joint and Spine Mechanics; Brain Injury; Cardiac Mechanics; Cardiovascular Devices, Fluids and Imaging; Cartilage and Disc Mechanics; Cell and Tissue Engineering; Cerebral Aneurysms; Computational Biofluid Dynamics; Device Design, Human Dynamics, and Rehabilitation; Drug Delivery and Disease Treatment; Engineered Cellular Environments ◽

10.1115/sbc2013-14392 ◽

2013 ◽

Author(s):

Mahdi Esmaily Moghadam ◽

Yuri Bazilevs ◽

Tain-Yen Hsia ◽

Alison Marsden

Keyword(s):

Strong Coupling ◽

Large Scale ◽

Computational Cost ◽

Circulatory System ◽

Flow Simulation ◽

Global Dynamics ◽

Domain Modeling ◽

Computationally Efficient ◽

Lumped Parameter ◽

System Solution

A closed-loop lumped parameter network (LPN) coupled to a 3D domain is a powerful tool that can be used to model the global dynamics of the circulatory system. Coupling a 0D LPN to a 3D CFD domain is a numerically challenging problem, often associated with instabilities, extra computational cost, and loss of modularity. A computationally efficient finite element framework has been recently proposed that achieves numerical stability without sacrificing modularity [1]. This type of coupling introduces new challenges in the linear algebraic equation solver (LS), producing an strong coupling between flow and pressure that leads to an ill-conditioned tangent matrix. In this paper we exploit this strong coupling to obtain a novel and efficient algorithm for the linear solver (LS). We illustrate the efficiency of this method on several large-scale cardiovascular blood flow simulation problems.

Download Full-text

A Computationally Efficient Predictive Controller for Lane Keeping of Semi-Autonomous Vehicles

Volume 1: Active Control of Aerospace Structure; Motion Control; Aerospace Control; Assistive Robotic Systems; Bio-Inspired Systems; Biomedical/Bioengineering Applications; Building Energy Systems; Condition Based Monitoring; Control Design for Drilling Automation; Control of Ground Vehicles, Manipulators, Mechatronic Systems; Controls for Manufacturing; Distributed Control; Dynamic Modeling for Vehicle Systems; Dynamics and Control of Mobile and Locomotion Robots; Electrochemical Energy Systems ◽

10.1115/dscc2014-6098 ◽

2014 ◽

Cited By ~ 3

Author(s):

Changchun Liu ◽

Chankyu Lee ◽

Andreas Hansen ◽

J. Karl Hedrick ◽

Jieyun Ding

Keyword(s):

Autonomous Vehicles ◽

Piecewise Linear ◽

Computational Cost ◽

Parametric Programming ◽

Linear Feedback ◽

Loop Model ◽

Computationally Efficient ◽

Feedback Function ◽

Predictive Controller ◽

Lane Keeping

Model predictive control (MPC) is a popular technique for the development of active safety systems. However, its high computational cost prevents it from being implemented on lower-cost hardware. This paper presents a computationally efficient predictive controller for lane keeping assistance systems. The controller shares control with the driver, and applies a correction steering when there is a potential lane departure. Using the explicit feedback MPC, a multi-parametric nonlinear programming problem with a human-in-the-loop model and safety constraints is formulated. The cost function is chosen as the difference between the linear state feedback function to be determined and the resultant optimal control sequence of the MPC problem solved off-line given the current state. The piecewise linear feedback function is obtained by solving the parametric programming with an approximation approach. The effectiveness of the controller is evaluated through numerical simulations.

Download Full-text