Context-Aware Seeds for Read Mapping

Abstract Motivation: Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t . As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results: We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mappings but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t . CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS reduces seed frequencies by up to 20.3% when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver.

Download Full-text

Context-Aware Seeds for Read Mapping

10.1101/643072 ◽

2019 ◽

Author(s):

Hongyi Xin ◽

Mingfu Shao ◽

Carl Kingsford

Keyword(s):

Edit Distance ◽

State Of The Art ◽

Linear Time ◽

Pigeonhole Principle ◽

Context Aware ◽

Read Mapping ◽

Distance Threshold ◽

E Coli ◽

Long Reads ◽

Confidence Radius

AbstractMotivationMost modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers.ResultsWe propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mapping but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS reduces seed frequencies by up to 25.4% when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver.Availabilityhttps://github.com/Kingsford-Group/CAS_code

Download Full-text

GraphChainer: Co-linear Chaining for Accurate Alignment of Long Reads to Variation Graphs

10.1101/2022.01.07.475257 ◽

2022 ◽

Author(s):

Jun Ma ◽

Manuel Cáceres ◽

Leena Salmela ◽

Veli Mäkinen ◽

Alexandru I. Tomescu

Keyword(s):

Edit Distance ◽

State Of The Art ◽

Variant Calling ◽

Real Data ◽

Read Length ◽

Human Chromosomes ◽

Distance Threshold ◽

Nature Biotechnology ◽

Long Reads ◽

Standard Task

Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications in e.g., improving variant calling. While the vg toolkit (Garrison et al., Nature Biotechnology, 2018) is a popular aligner of short reads, GraphAligner (Rautiainen and Marschall, Genome Biology, 2020) is the state-of-the-art aligner of long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. We present a new algorithm to co-linearly chain a set of seeds in an acyclic variation graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of long reads to variation graphs, GraphChainer. Compared to GraphAligner, at a normalized edit distance threshold of 40%, it aligns 9% to 12% more reads, and 15% to 19% more total read length, on real PacBio reads from human chromosomes 1 and 22. On both simulated and real data, GraphChainer aligns between 97% and 99% of all reads, and of total read length. At the more stringent normalized edit distance threshold of 30%, GraphChainer aligns up to 29% more total real read length than GraphAligner. GraphChainer is freely available at https://github.com/algbio/GraphChainer

Download Full-text

Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds

10.1101/2021.11.05.467453 ◽

2021 ◽

Author(s):

Pesho Ivanov ◽

Benjamin Bichsel ◽

Martin Vechev

Keyword(s):

Dynamic Programming ◽

Edit Distance ◽

Reference Genome ◽

State Of The Art ◽

Optimal Alignment ◽

Reference Mark ◽

A Algorithm ◽

Optimal Sequence ◽

E Coli ◽

Graph Alignment

We present a novel A* seed heuristic enabling fast and optimal sequence-to-graph alignment, guaranteed to minimize the edit distance of the alignment assuming non-negative edit costs. We phrase optimal alignment as a shortest path problem and solve it by instantiating the A* algorithm with our novel seed heuristic. The key idea of the seed heuristic is to extract seeds from the read, locate them in the reference, mark preceding reference positions by crumbs, and use the crumbs to direct the A* search. We prove admissibility of the seed heuristic, thus guaranteeing alignment optimality. Our implementation extends the free and open source AStarix aligner and demonstrates that the seed heuristic outperforms all state-of-the-art optimal aligners including GraphAligner, Vargas, PaSGAL, and the prefix heuristic previously employed by AStarix. Specifically, we achieve a consistent speedup of >60x on both short Illumina reads and long HiFi reads (up to 25kbp), on both the E. coli linear reference genome (1Mbp) and the MHC variant graph (5Mbp). Our speedup is enabled by the seed heuristic consistently skipping >99.99% of the table cells that optimal aligners based on dynamic programming compute.

Download Full-text

Enhanced context-aware recommendation using topic modeling and particle swarm optimization

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210331 ◽

2021 ◽

pp. 1-16

Author(s):

Ibtissem Gasmi ◽

Mohamed Walid Azizi ◽

Hassina Seridi-Bouchelaghem ◽

Nabiha Azizi ◽

Samir Brahim Belhaouari

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

State Of The Art ◽

Weighting Function ◽

Contextual Factors ◽

Pearson Correlation ◽

Correlation Coefficients ◽

Pso Algorithm ◽

Context Aware ◽

Proposed Model

Context-Aware Recommender System (CARS) suggests more relevant services by adapting them to the user’s specific context situation. Nevertheless, the use of many contextual factors can increase data sparsity while few context parameters fail to introduce the contextual effects in recommendations. Moreover, several CARSs are based on similarity algorithms, such as cosine and Pearson correlation coefficients. These methods are not very effective in the sparse datasets. This paper presents a context-aware model to integrate contextual factors into prediction process when there are insufficient co-rated items. The proposed algorithm uses Latent Dirichlet Allocation (LDA) to learn the latent interests of users from the textual descriptions of items. Then, it integrates both the explicit contextual factors and their degree of importance in the prediction process by introducing a weighting function. Indeed, the PSO algorithm is employed to learn and optimize weights of these features. The results on the Movielens 1 M dataset show that the proposed model can achieve an F-measure of 45.51% with precision as 68.64%. Furthermore, the enhancement in MAE and RMSE can respectively reach 41.63% and 39.69% compared with the state-of-the-art techniques.

Download Full-text

Suffix array for multi-pattern matching with variable length wildcards

Intelligent Data Analysis ◽

10.3233/ida-205087 ◽

2021 ◽

Vol 25 (2) ◽

pp. 283-303

Author(s):

Na Liu ◽

Fei Xie ◽

Xindong Wu

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Pattern Matching ◽

Edit Distance ◽

State Of The Art ◽

Suffix Array ◽

Variable Length ◽

Distance Method ◽

Efficient Data ◽

Comparison Algorithms

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.

Download Full-text

App2Vec: Context-Aware Application Usage Prediction

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451396 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1-21

Author(s):

Huandong Wang ◽

Yong Li ◽

Mu Du ◽

Zhenhui Li ◽

Depeng Jin

Keyword(s):

Dirichlet Process ◽

Service Providers ◽

State Of The Art ◽

Representation Learning ◽

Context Aware ◽

Challenging Problem ◽

Performance Gap ◽

Bayesian Mixture Model ◽

Bayesian Mixture ◽

Spatio Temporal

Both app developers and service providers have strong motivations to understand when and where certain apps are used by users. However, it has been a challenging problem due to the highly skewed and noisy app usage data. Moreover, apps are regarded as independent items in existing studies, which fail to capture the hidden semantics in app usage traces. In this article, we propose App2Vec, a powerful representation learning model to learn the semantic embedding of apps with the consideration of spatio-temporal context. Based on the obtained semantic embeddings, we develop a probabilistic model based on the Bayesian mixture model and Dirichlet process to capture when , where , and what semantics of apps are used to predict the future usage. We evaluate our model using two different app usage datasets, which involve over 1.7 million users and 2,000+ apps. Evaluation results show that our proposed App2Vec algorithm outperforms the state-of-the-art algorithms in app usage prediction with a performance gap of over 17.0%.

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

Context-Aware Presentation of Linked Data on Mobile

Information Retrieval and Management ◽

10.4018/978-1-5225-5191-1.ch087 ◽

2018 ◽

pp. 1940-1971

Author(s):

Luca Costabello ◽

Fabien Gandon

Keyword(s):

Linked Data ◽

Edit Distance ◽

Research Question ◽

Context Aware ◽

Graph Edit Distance ◽

Optimal Error ◽

Presentation Level ◽

Rdf Graphs ◽

Mobile Context ◽

Proper Context

In this paper the authors focus on context-aware adaptation for linked data on mobile. They split up the problem in two sub-questions: how to declaratively describe context at RDF presentation level, and how to overcome context imprecisions and incompleteness when selecting the proper context description at runtime. The authors answer their two-fold research question with PRISSMA, a context-aware presentation layer for Linked Data. PRISSMA extends the Fresnel vocabulary with the notion of mobile context. Besides, it includes an algorithm that determines whether the sensed context is compatible with some context declarations. The algorithm finds optimal error-tolerant subgraph isomorphisms between RDF graphs using the notion of graph edit distance and is sublinear in the number of context declarations in the system.

Download Full-text

Context-Aware Cultural Heritage Environments

Handbook of Research on Technologies and Cultural Heritage ◽

10.4018/978-1-60960-044-0.ch012 ◽

2011 ◽

pp. 241-258 ◽

Cited By ~ 1

Author(s):

Eleni Christopoulou ◽

John Garofalakis

Keyword(s):

Cultural Heritage ◽

Mobile Applications ◽

State Of The Art ◽

Effective Interaction ◽

Archaeological Sites ◽

Cultural Environment ◽

Context Aware

Cultural heritage environments, like museums, archaeological sites and cultural heritage cities, have gathered and preserved artefacts and relevant content for years. Today’s state of the art technology allows the shift from traditional exhibitions to ones with reinforced interaction among the cultural heritage environment and the visitor. For example, mobile applications have proved to be suitable to support such new forms of interaction. Effective interaction exploits information both from the cultural environment, the visitor, and the broader context in which they occur. The aim of this chapter is to present the value of context in applications designed for cultural heritage environments and to demonstrate an infrastructure that effectively exploits it.

Download Full-text

Haplotype threading: accurate polyploid phasing from long reads

Genome Biology ◽

10.1186/s13059-020-02158-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Sven D. Schrinner ◽

Rebecca Serra Mari ◽

Jana Ebler ◽

Mikko Rautiainen ◽

Lancelot Seillier ◽

...

Keyword(s):

Open Source ◽

Evolutionary History ◽

State Of The Art ◽

Regions Of Interest ◽

Breeding Strategies ◽

Polyploid Species ◽

Two Stage ◽

Long Reads ◽

History Of ◽

Genomic Regions

Abstract Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. Polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WhatsHap polyphase, a novel two-stage approach that addresses these challenges by (i) clustering reads and (ii) threading the haplotypes through the clusters. Our method outperforms the state-of-the-art in terms of phasing quality. Using a real tetraploid potato dataset, we demonstrate how to assemble local genomic regions of interest at the haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap.

Download Full-text