Generative probabilistic biological sequence models that account for mutational variability

Mapping Intimacies ◽

10.1101/2020.07.31.231381 ◽

2020 ◽

Author(s):

Eli N. Weinstein ◽

Debora S. Marks

Keyword(s):

Large Scale ◽

Sequence Data ◽

Disordered Proteins ◽

Biological Sequences ◽

Biological Sequence ◽

Multiple Sequence ◽

Continuous Space ◽

Future Evolution ◽

Disordered Protein ◽

Latent Representations

AbstractLarge-scale sequencing has revealed extraordinary diversity among biological sequences, produced over the course of evolution and within the lifetime of individual organisms. Existing methods for building statistical models of sequences often pre-process the data using multiple sequence alignment, an unreliable approach for many genetic elements (antibodies, disordered proteins, etc.) that is subject to fundamental statistical pathologies. Here we introduce a structured emission distribution (the MuE distribution) that accounts for mutational variability (substitutions and indels) and use it to construct generative and predictive hierarchical Bayesian models (H-MuE models). Our framework enables the application of arbitrary continuous-space vector models (e.g. linear regression, factor models, image neural-networks) to unaligned sequence data. Theoretically, we show that the MuE generalizes classic probabilistic alignment models. Empirically, we show that H-MuE models can infer latent representations and features for immune repertoires, predict functional unobserved members of disordered protein families, and forecast the future evolution of pathogens.

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Mapping Biomolecular Sequences: Graphical Representations - their Origins, Applications and Future Prospects

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207324666210510164743 ◽

2021 ◽

Vol 24 ◽

Author(s):

Ashesh Nandy

Keyword(s):

Dna Sequences ◽

Graphical Representation ◽

Sequence Data ◽

Basic Unit ◽

Graphical Representations ◽

Biological Sequences ◽

Biological Sequence ◽

New Approach ◽

3D Space ◽

2D And 3D

The exponential growth in the depositories of biological sequence data have generated an urgent need to store, retrieve and analyse the data efficiently and effectively for which the standard practice of using alignment procedures are not adequate due to high demand on computing resources and time. Graphical representation of sequences has become one of the most popular alignment-free strategies to analyse the biological sequences where each basic unit of the sequences – the bases adenine, cytosine, guanine and thymine for DNA/RNA, and the 20 amino acids for proteins – are plotted on a multi-dimensional grid. The resulting curve in 2D and 3D space and the implied graph in higher dimensions provide a perception of the underlying information of the sequences through visual inspection; numerical analyses, in geometrical or matrix terms, of the plots provide a measure of comparison between sequences and thus enable study of sequence hierarchies. The new approach has also enabled studies of comparisons of DNA sequences over many thousands of bases and provided new insights into the structure of the base compositions of DNA sequences In this article we review in brief the origins and applications of graphical representations and highlight the future perspectives in this field.

Download Full-text

An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics

ISRN Biomathematics ◽

10.1155/2013/615630 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 28

Author(s):

Jurate Daugelaite ◽

Aisling O' Driscoll ◽

Roy D. Sleator

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Sequence Data ◽

Cloud Base ◽

Data Sets ◽

Next Generation ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Computing Technologies

Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.

Download Full-text

PolyA: a tool for adjudicating competing annotations of biological sequences

10.1101/2021.02.13.430877 ◽

2021 ◽

Author(s):

Kaitlin M. Carey ◽

Robert Hubley ◽

George T. Lesica ◽

Daniel Olson ◽

Jack W. Roddy ◽

...

Keyword(s):

Large Scale ◽

Software Tool ◽

Alignment Score ◽

Biological Sequences ◽

Biological Sequence ◽

Sequence Elements

AbstractAnnotation of a biological sequence is usually performed by aligning that sequence to a database of known sequence elements. When that database contains elements that are highly similar to each other, the proper annotation may be ambiguous, because several entries in the database produce high-scoring alignments. Typical annotation methods work by assigning a label based on the candidate annotation with the highest alignment score; this can overstate annotation certainty, mislabel boundaries, and fails to identify large scale rearrangements or insertions within the annotated sequence. Here, we present a new software tool, PolyA, that adjudicates between competing alignment-based annotations by computing estimates of annotation confidence, identifying a trace with maximal confidence, and recursively splicing/stitching inserted elements. PolyA communicates annotation certainty, identifies large scale rearrangements, and detects boundaries between neighboring elements.

Download Full-text

Streamlining Data-Intensive Biology With Workflow Systems

10.1101/2020.06.30.178673 ◽

2020 ◽

Cited By ~ 1

Author(s):

Taylor Reiter ◽

Phillip T. Brooks ◽

Luiz Irber ◽

Shannon E.K. Joslin ◽

Charles M. Reid ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Open Science ◽

Biological Data ◽

Data Generation ◽

Biological Sequence ◽

Sequencing Data ◽

Workflow Systems

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Download Full-text

Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2019.4.3607 ◽

2019 ◽

Vol 14 (4) ◽

pp. 574-589

Author(s):

Linyan Xue ◽

Xiaoke Zhang ◽

Fei Xie ◽

Shuang Liu ◽

Peng Lin

Keyword(s):

Pattern Mining ◽

Sequence Data ◽

Biological Significance ◽

Frequent Pattern ◽

Frequent Patterns ◽

Biological Sequences ◽

Biological Sequence ◽

Protein Database ◽

Sequence Pattern ◽

Multiple Sequences

In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.

Download Full-text

Applications of Machine and Deep Learning in Adaptive Immunity

Annual Review of Chemical and Biomolecular Engineering ◽

10.1146/annurev-chembioeng-101420-125021 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Margarita Pertseva ◽

Beichen Gao ◽

Daniel Neumeier ◽

Alexander Yermanos ◽

Sai T. Reddy

Keyword(s):

Deep Learning ◽

Adaptive Immunity ◽

Large Scale ◽

Sequence Data ◽

Annual Review ◽

Publication Date ◽

Biological Sequence ◽

Antigen Specificity ◽

Adaptive Immune ◽

Receptor Repertoire

Adaptive immunity is mediated by lymphocyte B and T cells, which respectively express a vast and diverse repertoire of B cell and T cell receptors and, in conjunction with peptide antigen presentation through major histocompatibility complexes (MHCs), can recognize and respond to pathogens and diseased cells. In recent years, advances in deep sequencing have led to a massive increase in the amount of adaptive immune receptor repertoire data; additionally, proteomics techniques have led to a wealth of data on peptide–MHC presentation. These large-scale data sets are now making it possible to train machine and deep learning models, which can be used to identify complex and high-dimensional patterns in immune repertoires. This article introduces adaptive immune repertoires and machine and deep learning related to biological sequence data and then summarizes the many applications in this field, which span from predicting the immunological status of a host to the antigen specificity of individual receptors and the engineering of immunotherapeutics. Expected final online publication date for the Annual Review of Chemical and Biomolecular Engineering, Volume 12 is June 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

SeqRepo: A system for managing local collections of biological sequences

PLoS ONE ◽

10.1371/journal.pone.0239883 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0239883

Author(s):

Reece K. Hart ◽

Andreas Prlić

Keyword(s):

Programming Languages ◽

High Performance ◽

Sequence Data ◽

Random Access ◽

Biological Sequences ◽

Biological Sequence ◽

Public And Private ◽

Human Sequence ◽

Local Sequence ◽

Sequence Identifier

Motivation Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility. Results Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets. Availability SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.

Download Full-text

Adaptive dating and fast proposals: revisiting the phylogenetic relaxed clock model

10.1101/2020.09.09.289124 ◽

2020 ◽

Author(s):

Jordan Douglas ◽

Rong Zhang ◽

Remco Bouckaert

Keyword(s):

Sequence Data ◽

Extreme Case ◽

Evolutionary Divergence ◽

Biological Sequences ◽

Biological Sequence ◽

Clock Model ◽

Phylogenetic Framework ◽

Bayesian Phylogenetic Inference ◽

Clock Models ◽

Relaxed Clock

AbstractUncorrelated relaxed clock models enable estimation of molecular substitution rates across lineages and are widely used in phylogenetics for dating evolutionary divergence times. In this article we delved into the internal complexities of the relaxed clock model in order to develop efficient MCMC operators for Bayesian phylogenetic inference. We compared three substitution rate parameterisations, introduced an adaptive operator which learns the weights of other operators during MCMC, and we explored how relaxed clock model estimation can benefit from two cutting-edge proposal kernels: the AVMVN and Bactrian kernels. This work has produced an operator scheme that is up to 65 times more efficient at exploring continuous relaxed clock parameters compared with previous setups, depending on the dataset. Finally, we explored variants of the standard narrow exchange operator which are specifically designed for the relaxed clock model. In the most extreme case, this new operator traversed tree space 40% more efficiently than narrow exchange. The methodologies introduced are adaptive and highly effective on short as well as long alignments. The results are available via the open source optimised relaxed clock (ORC) package for BEAST 2 under a GNU licence (https://github.com/jordandouglas/ORC).Author summaryBiological sequences, such as DNA, accumulate mutations over generations. By comparing such sequences in a phylogenetic framework, the evolutionary tree of lifeforms can be inferred. With the overwhelming availability of biological sequence data, and the increasing affordability of collecting new data, the development of fast and efficient phylogenetic algorithms is more important than ever. In this article we focus on the relaxed clock model, which is very popular in phylogenetics. We explored how a range of optimisations can improve the statistical inference of the relaxed clock. This work has produced a phylogenetic setup which can infer parameters related to the relaxed clock up to 65 times faster than previous setups, depending on the dataset. The methods introduced adapt to the dataset during computation and are highly efficient when processing long biological sequences.

Download Full-text

Aligning Multiple Sequences Using an Improved Tabu Search Algorithm

Journal of Circuits System and Computers ◽

10.1142/s0218126617500669 ◽

2016 ◽

Vol 26 (04) ◽

pp. 1750066 ◽

Cited By ~ 1

Author(s):

Lamiche Chaabane ◽

Moussaoui Abdelouahab

Keyword(s):

Tabu Search ◽

Sequence Alignment ◽

Dna Sequences ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Search Algorithm ◽

Protein Structures ◽

Biological Sequence ◽

Multiple Sequence ◽

Alignment Problem

One of the most essential operations in biological sequence analysis is multiple sequence alignment (MSA), where it is used for constructing evolutionary trees for DNA sequences and for analyzing the protein structures to help design new proteins. In this research study, a new method for solving sequence alignment problem is proposed, which is named improved tabu search (ITS). This algorithm is based on the classical tabu search (TS) optimizing technique. ITS is implemented in order to obtain results of multiple sequence alignment. Several variants concerning neighborhood generation and intensification/diversification strategies for our proposed ITS are investigated. Simulation results on a large scale of datasets have shown the efficacy of the developed approach and its capacity to achieve good quality solutions in terms of scores comparing to those given by other existing methods.

Download Full-text