biological sequences Latest Research Papers

Palindromic Vectors, Symmetropy and Symmentropy as Symmetry Descriptors of Binary Data

Entropy ◽

10.3390/e24010082 ◽

2022 ◽

Vol 24 (1) ◽

pp. 82

Author(s):

Jean-Marc Girault ◽

Sébastien Ménigot

Keyword(s):

Relative Error ◽

Dna Sequences ◽

Mirror Symmetry ◽

Binary Data ◽

Random Variable ◽

Point Of View ◽

Binary Sequences ◽

Biological Sequences ◽

Symmetry Properties ◽

Better Than

Today, the palindromic analysis of biological sequences, based exclusively on the study of “mirror” symmetry properties, is almost unavoidable. However, other types of symmetry, such as those present in friezes, could allow us to analyze binary sequences from another point of view. New tools, such as symmetropy and symmentropy, based on new types of palindromes allow us to discriminate binarized 1/f noise sequences better than Lempel–Ziv complexity. These new palindromes with new types of symmetry also allow for better discrimination of binarized DNA sequences. A relative error of 6% of symmetropy is obtained from the HUMHBB and YEAST1 DNA sequences. A factor of 4 between the slopes obtained from the linear fits of the local symmentropies for the two DNA sequences shows the discriminative capacity of the local symmentropy. Moreover, it is highlighted that a certain number of these new palindromes of sizes greater than 30 bits are more discriminating than those of smaller sizes assimilated to those from an independent and identically distributed random variable.

Download Full-text

Accelerating Smith-Waterman Algorithm for Faster Sequence Alignment using Graphical Processing Unit

Journal of Physics Conference Series ◽

10.1088/1742-6596/2161/1/012028 ◽

2022 ◽

Vol 2161 (1) ◽

pp. 012028

Author(s):

Karamjeet Kaur ◽

Sudeshna Chakraborty ◽

Manoj Kumar Gupta

Keyword(s):

Sequence Alignment ◽

Time Complexity ◽

Graphic Processing Unit ◽

Parallel Implementation ◽

Processing Unit ◽

Biological Sequences ◽

Sequential Approach ◽

Alignment Process ◽

Graphical Processing ◽

Sequential Implementation

Abstract In bioinformatics, sequence alignment is very important task to compare and find similarity between biological sequences. Smith Waterman algorithm is most widely used for alignment process but it has quadratic time complexity. This algorithm is using sequential approach so if the no. of biological sequences is increasing then it takes too much time to align sequences. In this paper, parallel approach of Smith Waterman algorithm is proposed and implemented according to the architecture of graphic processing unit using CUDA in which features of GPU is combined with CPU in such a way that alignment process is three times faster than sequential implementation of Smith Waterman algorithm and helps in accelerating the performance of sequence alignment using GPU. This paper describes the parallel implementation of sequence alignment using GPU and this intra-task parallelization strategy reduces the execution time. The results show significant runtime savings on GPU.

Download Full-text

TCR-BERT: learning the grammar of T-cell receptors for flexible antigen- binding analyses

10.1101/2021.11.18.469186 ◽

2021 ◽

Author(s):

Kevin E. Wu ◽

Kathryn E. Yost ◽

Bence Daniel ◽

Julia A Belk ◽

Yu Xia ◽

...

Keyword(s):

T Cell ◽

T Cell Receptor ◽

T Cell Receptors ◽

State Of The Art ◽

Biological Significance ◽

Cell Receptor ◽

Antigen Binding ◽

Binding Affinities ◽

Biological Sequences ◽

Deep Learning Model

The T-cell receptor (TCR) allows T-cells to recognize and respond to antigens presented by infected and diseased cells. However, due to TCRs' staggering diversity and the complex binding dynamics underlying TCR antigen recognition, it is challenging to predict which antigens a given TCR may bind to. Here, we present TCR-BERT, a deep learning model that applies self-supervised transfer learning to this problem. TCR-BERT leverages unlabeled TCR sequences to learn a general, versatile representation of TCR sequences, enabling numerous downstream applications. We demonstrate that TCR-BERT can be used to build state-of-the-art TCR-antigen binding predictors with improved generalizability compared to prior methods. TCR-BERT simultaneously facilitates clustering sequences likely to share antigen specificities. It also facilitates computational approaches to challenging, unsolved problems such as designing novel TCR sequences with engineered binding affinities. Importantly, TCR-BERT enables all these advances by focusing on residues with known biological significance. TCR-BERT can be a useful tool for T-cell scientists, enabling greater understanding and more diverse applications, and provides a conceptual framework for leveraging unlabeled data to improve machine learning on biological sequences.

Download Full-text

PURC v2.0: a program for improved sequence inference for polyploid phylogenetics and other manifestations of the multiple-copy problem

10.1101/2021.11.18.468666 ◽

2021 ◽

Author(s):

Peter W Schafran ◽

Fay-Wei W Li ◽

Carl Rothfels

Keyword(s):

Traditional Approach ◽

Consensus Sequence ◽

Operational Taxonomic Unit ◽

Phylogenetic Inference ◽

Biological Sequences ◽

Multiple Copy ◽

Sequencing Data ◽

Sequencing Errors ◽

Manual Curation ◽

Similarity Thresholds

Inferring the true biological sequences from amplicon mixtures remains a difficult bioinformatic problem. The traditional approach is to cluster sequencing reads by similarity thresholds and treat the consensus sequence of each cluster as an "operational taxonomic unit" (OTU). Recently, this approach has been improved upon by model-based methods that correct PCR and sequencing errors in order to infer "amplicon sequence variants" (ASVs). To date, ASV approaches have been used primarily in metagenomics, but they are also useful for identifying allelic or paralogous variants and for determining homeologs in polyploid organisms. To facilitate the usage of ASV methods among polyploidy researchers, we incorporated ASV inference alongside OTU clustering in PURC v2.0, a major update to PURC (Pipeline for Untangling Reticulate Complexes). In addition to preserving original PURC functions, PURC v2.0 allows users to process PacBio CCS/HiFi reads through DADA2 to generate and annotate ASVs for multiplexed data, with outputs including separate alignments for each locus ready for phylogenetic inference. In addition, PURC v2.0 features faster demultiplexing than the original version and has been updated to be compatible with Python 3. In this chapter we present results indicating that PURC v2.0 (using the ASV approach) is more likely to infer the correct biological sequences in comparison to the earlier OTU-based PURC, and describe how to prepare sequencing data, run PURC v2.0 under several different modes, and interpret the output. We expect that PURC v2.0 will provide biologists with a method for generating multi-locus "moderate data" datasets that are large enough to be phylogenetically informative and small enough for manual curation.

Download Full-text

Accelerating in-silico saturation mutagenesis using compressed sensing

10.1101/2021.11.08.467498 ◽

2021 ◽

Author(s):

Jacob Schreiber ◽

Surag Nair ◽

Akshay Balsubramani ◽

Anshul Kundaje

Keyword(s):

Compressed Sensing ◽

In Silico ◽

Computational Genomics ◽

Saturation Mutagenesis ◽

Sequence Length ◽

Biological Sequences ◽

Constant Number ◽

Order Of Magnitude ◽

Speed Up ◽

The Difference

In-silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating feature attributions on biological sequences that proceeds by systematically perturbing each position in a sequence and recording the difference in model output. However, this method can be slow because systematically perturbing each position requires performing a number of forward passes proportional to the length of the sequence being examined. In this work, we propose a modification of ISM that leverages the principles of compressed sensing to require only a constant number of forward passes, regardless of sequence length, when applied to models that contain operations with a limited receptive field, such as convolutions. Our method, named Yuzu, can reduce the time that ISM spends in convolution operations by several orders of magnitude and, consequently, Yuzu can speed up ISM on several commonly used architectures in genomics by over an order of magnitude. Notably, we found that Yuzu provides speedups that increase with the complexity of the convolution operation and the length of the sequence being analyzed, suggesting that Yuzu provides large benefits in realistic settings. We have made this tool available at https://github.com/kundajelab/yuzu.

Download Full-text

Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks

10.1101/2021.11.08.467651 ◽

2021 ◽

Author(s):

Daniel Liu ◽

Martin Steinegger

Keyword(s):

Protein Sequences ◽

Single Instruction Multiple Data ◽

Alignment Algorithm ◽

Biological Sequences ◽

Pairwise Sequence Alignment ◽

Popular Method ◽

Multiple Data ◽

Long Reads ◽

Speed Up ◽

Scoring Schemes

Background: The Smith-Waterman-Gotoh alignment algorithm is the most popular method for comparing biological sequences. Recently, Single Instruction Multiple Data methods have been used to speed up alignment. However, these algorithms have limitations like being optimized for specific scoring schemes, cannot handle large gaps, or require quadratic time computation. Results: We propose a new algorithm called block aligner for aligning nucleotide and protein sequences. It greedily shifts and grows a block of computed scores to span large gaps within the aligned sequences. This greedy approach is able to only compute a fraction of the DP matrix. In exchange for these features, there is no guarantee that the computed scores are accurate compared to full DP. However, in our experiments, we show that block aligner performs accurately on various realistic datasets, and it is up to 9 times faster than the popular Farrar's algorithm for protein global alignments. Conclusions: Our algorithm has applications in computing global alignments and X-drop alignments on proteins and long reads. It is available as a Rust library at https://github.com/Daniel-Liu-c0deb0t/block-aligner.

Download Full-text

fLPS 2.0: rapid annotation of compositionally-biased regions in biological sequences

PeerJ ◽

10.7717/peerj.12363 ◽

2021 ◽

Vol 9 ◽

pp. e12363

Author(s):

Paul M. Harrison

Keyword(s):

Dark Matter ◽

Dna Sequences ◽

Low Complexity ◽

Biological Sequences ◽

Link Type ◽

Physico Chemical ◽

Protein Universe ◽

Supplemental File ◽

Chemical Character

Compositionally-biased (CB) regions in biological sequences are enriched for a subset of sequence residue types. These can be shorter regions with a concentrated bias (i.e., those termed ‘low-complexity’), or longer regions that have a compositional skew. These regions comprise a prominent class of the uncharacterized ‘dark matter’ of the protein universe. Here, I report the latest version of the fLPS package for the annotation of CB regions, which includes added consideration of DNA sequences, to label the eight possible biased regions of DNA. In this version, the user is now able to restrict analysis to a specified subset of residue types, and also to filter for previously annotated domains to enable detection of discontinuous CB regions. A ‘thorough’ option has been added which enables the labelling of subtler biases, typically made from a skew for several residue types. In the output, protein CB regions are now labelled with bias classes reflecting the physico-chemical character of the biasing residues. The fLPS 2.0 package is available from: https://github.com/pmharrison/flps2 or in a Supplemental File of this paper.

Download Full-text

Memes: A motif analysis environment in R using tools from the MEME Suite

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008991 ◽

2021 ◽

Vol 17 (9) ◽

pp. e1008991

Author(s):

Spencer L. Nystrom ◽

Daniel J. McKay

Keyword(s):

Data Access ◽

R Package ◽

Comprehensive Analysis ◽

Multidimensional Data ◽

Biological Sequences ◽

Bioconductor Package ◽

Motif Analysis ◽

Bioconductor Project ◽

Analysis Environment ◽

Selection Of

Identification of biopolymer motifs represents a key step in the analysis of biological sequences. The MEME Suite is a widely used toolkit for comprehensive analysis of biopolymer motifs; however, these tools are poorly integrated within popular analysis frameworks like the R/Bioconductor project, creating barriers to their use. Here we present memes, an R package that provides a seamless R interface to a selection of popular MEME Suite tools. memes provides a novel “data aware” interface to these tools, enabling rapid and complex discriminative motif analysis workflows. In addition to interfacing with popular MEME Suite tools, memes leverages existing R/Bioconductor data structures to store the multidimensional data returned by MEME Suite tools for rapid data access and manipulation. Finally, memes provides data visualization capabilities to facilitate communication of results. memes is available as a Bioconductor package at https://bioconductor.org/packages/memes, and the source code can be found at github.com/snystrom/memes.

Download Full-text

A Virtual Machine Platform for Non-Computer Professionals for Using Deep Learning to Classify Biological Sequences of Metagenomic Data

Journal of Visualized Experiments ◽

10.3791/62250 ◽

2021 ◽

Author(s):

Zhencheng Fang ◽

Hongwei Zhou

Keyword(s):

Deep Learning ◽

Virtual Machine ◽

Metagenomic Data ◽

Biological Sequences ◽

Computer Professionals

Download Full-text

Semiotic thoughts on biological sequence representations

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207324666210705112232 ◽

2021 ◽

Vol 24 ◽

Author(s):

Guillermo Restrepo

Keyword(s):

Research Area ◽

Biological Sequences ◽

Biological Sequence ◽

Mathematical Chemistry ◽

Dna And Rna

: The deluge of biological sequences ranging from those of proteins, DNA and RNA to genomes has increased the models for their representation, which are further used to contrast those sequences. Here we present a brief bibliometric description of the research area devoted to representation of biological sequences and highlight the semiotic reaches of this process. Finally, we argue that this research area needs further research according to the evolution of mathematical chemistry and its drawbacks are required to be overcome.

Download Full-text

biological sequences
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Palindromic Vectors, Symmetropy and Symmentropy as Symmetry Descriptors of Binary Data

Accelerating Smith-Waterman Algorithm for Faster Sequence Alignment using Graphical Processing Unit

TCR-BERT: learning the grammar of T-cell receptors for flexible antigen- binding analyses

PURC v2.0: a program for improved sequence inference for polyploid phylogenetics and other manifestations of the multiple-copy problem

Accelerating in-silico saturation mutagenesis using compressed sensing

Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks

fLPS 2.0: rapid annotation of compositionally-biased regions in biological sequences

Memes: A motif analysis environment in R using tools from the MEME Suite

A Virtual Machine Platform for Non-Computer Professionals for Using Deep Learning to Classify Biological Sequences of Metagenomic Data

Semiotic thoughts on biological sequence representations

Export Citation Format

biological sequencesRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Palindromic Vectors, Symmetropy and Symmentropy as Symmetry Descriptors of Binary Data

Accelerating Smith-Waterman Algorithm for Faster Sequence Alignment using Graphical Processing Unit

TCR-BERT: learning the grammar of T-cell receptors for flexible antigen- binding analyses

PURC v2.0: a program for improved sequence inference for polyploid phylogenetics and other manifestations of the multiple-copy problem

Accelerating in-silico saturation mutagenesis using compressed sensing

Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks

fLPS 2.0: rapid annotation of compositionally-biased regions in biological sequences

Memes: A motif analysis environment in R using tools from the MEME Suite

A Virtual Machine Platform for Non-Computer Professionals for Using Deep Learning to Classify Biological Sequences of Metagenomic Data

Semiotic thoughts on biological sequence representations

biological sequences
Recently Published Documents