Boosting Graph Alignment Algorithms

Sequence alignment algorithms and database search methods use BLOSUM and PAM substitution matrices constructed from general proteins. These de facto matrices are not optimal to align sequences accurately, for the proteins with markedly different compositional bias in the amino acid. In this work, a new amino acid substitution matrix is calculated for the disorder and low complexity rich region of Hub proteins, based on residue characteristics. Insights into the amino acid background frequencies and the substitution scores obtained from the Hubsm unveils the residue substitution patterns which differs from commonly used scoring matrices .When comparing the Hub protein sequences for detecting homologs, the use of this Hubsm matrix yields better results than PAM and BLOSUM matrices. Usage of Hubsm matrix can be optimal in database search and for the construction of more accurate sequence alignments of Hub proteins.

Download Full-text

A Comprehensive Analysis of Sequence Alignment Algorithms for LongRead Sequencing

Current Bioinformatics ◽

10.2174/1574893611666160115213144 ◽

2016 ◽

Vol 11 (3) ◽

pp. 375-381

Author(s):

Yu Zhang ◽

Jian Tai He ◽

Yangde Zhang ◽

Ke Zuo

Keyword(s):

Sequence Alignment ◽

Comprehensive Analysis ◽

Alignment Algorithms

Download Full-text

An Efficient Tool for Searching Maximal and Super Maximal Repeats in Large DNA/Protein Sequences via Induced-Enhanced Suffix Array

Recent Patents on Computer Science ◽

10.2174/2213275911666181107095645 ◽

2019 ◽

Vol 12 (2) ◽

pp. 128-134

Author(s):

Sanjeev Kumar ◽

Suneeta Agarwal ◽

Ranvijay

Keyword(s):

Protein Sequences ◽

Input Sequence ◽

Suffix Array ◽

Secondary Memory ◽

Time And Space ◽

Efficient Tool ◽

Frequency Distributions ◽

Alignment Algorithms ◽

Common Prefix ◽

Art Works

Background: DNA and Protein sequences of an organism contain a variety of repeated structures of various types. These repeated structures play an important role in Molecular biology as they are related to genetic backgrounds of inherited diseases. They also serve as a marker for DNA mapping and DNA fingerprinting. Efficient searching of maximal and super maximal repeats in DNA/Protein sequences can lead to many other applications in the area of genomics. Moreover, these repeats can also be used for identification of critical diseases by finding the similarity between frequency distributions of repeats in viruses and genomes (without using alignment algorithms). Objective: The study aims to develop an efficient tool for searching maximal and super maximal repeats in large DNA/Protein sequences. Methods: The proposed tool uses a newly introduced data structure Induced Enhanced Suffix Array (IESA). IESA is an extension of enhanced suffix array. It uses induced suffix array instead of classical suffix array. IESA consists of Induced Suffix Array (ISA) and an additional array-Longest Common Prefix (LCP) array. ISA is an array of all sorted suffixes of the input sequence while LCP array stores the lengths of the longest common prefixes between all pairs of consecutive suffixes in an induced suffix array. IESA is known to be efficient w.r.t. both time and space. It facilitates the use of secondary memory for constructing the large suffix-array. Results: An open source standalone tool named MSR-IESA for searching maximal and super maximal repeats in DNA/Protein sequences is provided at https://github.com/sanjeevalg/MSRIESA. Experimental results show that the proposed algorithm outperforms other state of the art works w.r.t. to both time and space. Conclusion: The proposed tool MSR-IESA is remarkably efficient for the analysis of DNA/Protein sequences, having maximal and super maximal repeats of any length. It can be used for identification of well-known diseases.

Download Full-text

Predicting Immunogenicity Risk in Biopharmaceuticals

Symmetry ◽

10.3390/sym13030388 ◽

2021 ◽

Vol 13 (3) ◽

pp. 388

Author(s):

Nikolet Doneva ◽

Irini Doytchinova ◽

Ivan Dimitrov

Keyword(s):

Machine Learning Algorithms ◽

Computational Tools ◽

Binding Motifs ◽

B Cell Epitopes ◽

Crucial Step ◽

Histocompatibility Complex ◽

Alignment Algorithms ◽

Experimental Approaches ◽

Dynamics Simulations ◽

Allergenicity Prediction

The assessment of immunogenicity of biopharmaceuticals is a crucial step in the process of their development. Immunogenicity is related to the activation of adaptive immunity. The complexity of the immune system manifests through numerous different mechanisms, which allows the use of different approaches for predicting the immunogenicity of biopharmaceuticals. The direct experimental approaches are sometimes expensive and time consuming, or their results need to be confirmed. In this case, computational methods for immunogenicity prediction appear as an appropriate complement in the process of drug design. In this review, we analyze the use of various In silico methods and approaches for immunogenicity prediction of biomolecules: sequence alignment algorithms, predicting subcellular localization, searching for major histocompatibility complex (MHC) binding motifs, predicting T and B cell epitopes based on machine learning algorithms, molecular docking, and molecular dynamics simulations. Computational tools for antigenicity and allergenicity prediction also are considered.

Download Full-text

Unsupervised Graph Alignment with Wasserstein Distance Discriminator

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining ◽

10.1145/3447548.3467332 ◽

2021 ◽

Author(s):

Ji Gao ◽

Xiao Huang ◽

Jundong Li

Keyword(s):

Wasserstein Distance ◽

Graph Alignment

Download Full-text

Accelerating Sequence-to-Graph Alignment on Heterogeneous Processors

10.1145/3472456.3472505 ◽

2021 ◽

Author(s):

Zonghao Feng ◽

Qiong Luo

Keyword(s):

Heterogeneous Processors ◽

Graph Alignment

Download Full-text

Faster short-read mapping with strobemer seeds in syncmer space

10.1101/2021.06.18.449070 ◽

2021 ◽

Author(s):

Kristoffer Sahlin

Keyword(s):

High Speed ◽

Mapping Accuracy ◽

Original Sequence ◽

Short Read ◽

Short Read Mapping ◽

Alignment Algorithms ◽

Reverse Complement ◽

Short Read Aligner ◽

Candidate Regions ◽

Burrows Wheeler Transform

Short-read genome alignment is a fundamental computational step used in many bioinformatic analyses. It is therefore desirable to align such data as fast as possible. Most alignment algorithms consider a seed-and-extend approach. Several popular programs perform the seeding step based on the Burrows-Wheeler Transform with a low memory footprint, but they are relatively slow compared to more recent approaches that use a minimizer-based seeding-and-chaining strategy. Recently, syncmers and strobemers were proposed for sequence comparison. Both protocols were designed for improved conservation of matches between sequences under mutations. Syncmers is a thinning protocol proposed as an alternative to minimizers, while strobemers is a linking protocol for gapped sequences and was proposed as an alternative to k-mers. The main contribution in this work is a new seeding approach that combines syncmers and strobemers. We use a strobemer protocol (randstrobes) to link together syncmers (i.e., in syncmer-space) instead of over the original sequence. Our protocol allows us to create longer seeds while preserving mapping accuracy. A longer seed length reduces the number of candidate regions which allows faster mapping and alignment. We also contribute the insight that speed-wise, this protocol is particularly effective when syncmers are canonical. Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. We implement our idea in a proof-of-concept short-read aligner strobealign that aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2. Many implementation versions of, e.g., BWA, achieve high speed on specific hardware. Our contribution is algorithmic and requires no hardware architecture or system-specific instructions. Strobealign is available at https://github.com/ksahlin/StrobeAlign.

Download Full-text

Node Score, Graph Alignment

Encyclopedia of Systems Biology ◽

10.1007/978-1-4419-9863-7_996 ◽

2013 ◽

pp. 1526-1527

Author(s):

Michal Kolář

Keyword(s):

Graph Alignment

Download Full-text

Performance of Forced-Alignment Algorithms on Children's Speech

Journal of Speech Language and Hearing Research ◽

10.1044/2020_jslhr-20-00268 ◽

2021 ◽

pp. 1-10

Author(s):

Tristan J. Mahr ◽

Visar Berisha ◽

Kan Kawabata ◽

Julie Liss ◽

Katherine C. Hustad

Keyword(s):

Gold Standard ◽

Acoustic Measurement ◽

Manual Segmentation ◽

Speech Sample ◽

Older Children ◽

Adaptive Training ◽

Alignment Algorithms ◽

Child Speech ◽

Speech Recognition Engine ◽

Children's Speech

Purpose Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material https://doi.org/10.23641/asha.14167058

Download Full-text