biological sequences
Recently Published Documents


TOTAL DOCUMENTS

471
(FIVE YEARS 102)

H-INDEX

33
(FIVE YEARS 5)

Entropy ◽  
2022 ◽  
Vol 24 (1) ◽  
pp. 82
Author(s):  
Jean-Marc Girault ◽  
Sébastien Ménigot

Today, the palindromic analysis of biological sequences, based exclusively on the study of “mirror” symmetry properties, is almost unavoidable. However, other types of symmetry, such as those present in friezes, could allow us to analyze binary sequences from another point of view. New tools, such as symmetropy and symmentropy, based on new types of palindromes allow us to discriminate binarized 1/f noise sequences better than Lempel–Ziv complexity. These new palindromes with new types of symmetry also allow for better discrimination of binarized DNA sequences. A relative error of 6% of symmetropy is obtained from the HUMHBB and YEAST1 DNA sequences. A factor of 4 between the slopes obtained from the linear fits of the local symmentropies for the two DNA sequences shows the discriminative capacity of the local symmentropy. Moreover, it is highlighted that a certain number of these new palindromes of sizes greater than 30 bits are more discriminating than those of smaller sizes assimilated to those from an independent and identically distributed random variable.


2022 ◽  
Vol 2161 (1) ◽  
pp. 012028
Author(s):  
Karamjeet Kaur ◽  
Sudeshna Chakraborty ◽  
Manoj Kumar Gupta

Abstract In bioinformatics, sequence alignment is very important task to compare and find similarity between biological sequences. Smith Waterman algorithm is most widely used for alignment process but it has quadratic time complexity. This algorithm is using sequential approach so if the no. of biological sequences is increasing then it takes too much time to align sequences. In this paper, parallel approach of Smith Waterman algorithm is proposed and implemented according to the architecture of graphic processing unit using CUDA in which features of GPU is combined with CPU in such a way that alignment process is three times faster than sequential implementation of Smith Waterman algorithm and helps in accelerating the performance of sequence alignment using GPU. This paper describes the parallel implementation of sequence alignment using GPU and this intra-task parallelization strategy reduces the execution time. The results show significant runtime savings on GPU.


2021 ◽  
Author(s):  
Kevin E. Wu ◽  
Kathryn E. Yost ◽  
Bence Daniel ◽  
Julia A Belk ◽  
Yu Xia ◽  
...  

The T-cell receptor (TCR) allows T-cells to recognize and respond to antigens presented by infected and diseased cells. However, due to TCRs' staggering diversity and the complex binding dynamics underlying TCR antigen recognition, it is challenging to predict which antigens a given TCR may bind to. Here, we present TCR-BERT, a deep learning model that applies self-supervised transfer learning to this problem. TCR-BERT leverages unlabeled TCR sequences to learn a general, versatile representation of TCR sequences, enabling numerous downstream applications. We demonstrate that TCR-BERT can be used to build state-of-the-art TCR-antigen binding predictors with improved generalizability compared to prior methods. TCR-BERT simultaneously facilitates clustering sequences likely to share antigen specificities. It also facilitates computational approaches to challenging, unsolved problems such as designing novel TCR sequences with engineered binding affinities. Importantly, TCR-BERT enables all these advances by focusing on residues with known biological significance. TCR-BERT can be a useful tool for T-cell scientists, enabling greater understanding and more diverse applications, and provides a conceptual framework for leveraging unlabeled data to improve machine learning on biological sequences.


2021 ◽  
Author(s):  
Peter W Schafran ◽  
Fay-Wei W Li ◽  
Carl Rothfels

Inferring the true biological sequences from amplicon mixtures remains a difficult bioinformatic problem. The traditional approach is to cluster sequencing reads by similarity thresholds and treat the consensus sequence of each cluster as an "operational taxonomic unit" (OTU). Recently, this approach has been improved upon by model-based methods that correct PCR and sequencing errors in order to infer "amplicon sequence variants" (ASVs). To date, ASV approaches have been used primarily in metagenomics, but they are also useful for identifying allelic or paralogous variants and for determining homeologs in polyploid organisms. To facilitate the usage of ASV methods among polyploidy researchers, we incorporated ASV inference alongside OTU clustering in PURC v2.0, a major update to PURC (Pipeline for Untangling Reticulate Complexes). In addition to preserving original PURC functions, PURC v2.0 allows users to process PacBio CCS/HiFi reads through DADA2 to generate and annotate ASVs for multiplexed data, with outputs including separate alignments for each locus ready for phylogenetic inference. In addition, PURC v2.0 features faster demultiplexing than the original version and has been updated to be compatible with Python 3. In this chapter we present results indicating that PURC v2.0 (using the ASV approach) is more likely to infer the correct biological sequences in comparison to the earlier OTU-based PURC, and describe how to prepare sequencing data, run PURC v2.0 under several different modes, and interpret the output. We expect that PURC v2.0 will provide biologists with a method for generating multi-locus "moderate data" datasets that are large enough to be phylogenetically informative and small enough for manual curation.


2021 ◽  
Author(s):  
Jacob Schreiber ◽  
Surag Nair ◽  
Akshay Balsubramani ◽  
Anshul Kundaje

In-silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating feature attributions on biological sequences that proceeds by systematically perturbing each position in a sequence and recording the difference in model output. However, this method can be slow because systematically perturbing each position requires performing a number of forward passes proportional to the length of the sequence being examined. In this work, we propose a modification of ISM that leverages the principles of compressed sensing to require only a constant number of forward passes, regardless of sequence length, when applied to models that contain operations with a limited receptive field, such as convolutions. Our method, named Yuzu, can reduce the time that ISM spends in convolution operations by several orders of magnitude and, consequently, Yuzu can speed up ISM on several commonly used architectures in genomics by over an order of magnitude. Notably, we found that Yuzu provides speedups that increase with the complexity of the convolution operation and the length of the sequence being analyzed, suggesting that Yuzu provides large benefits in realistic settings. We have made this tool available at https://github.com/kundajelab/yuzu.


2021 ◽  
Author(s):  
Daniel Liu ◽  
Martin Steinegger

Background: The Smith-Waterman-Gotoh alignment algorithm is the most popular method for comparing biological sequences. Recently, Single Instruction Multiple Data methods have been used to speed up alignment. However, these algorithms have limitations like being optimized for specific scoring schemes, cannot handle large gaps, or require quadratic time computation. Results: We propose a new algorithm called block aligner for aligning nucleotide and protein sequences. It greedily shifts and grows a block of computed scores to span large gaps within the aligned sequences. This greedy approach is able to only compute a fraction of the DP matrix. In exchange for these features, there is no guarantee that the computed scores are accurate compared to full DP. However, in our experiments, we show that block aligner performs accurately on various realistic datasets, and it is up to 9 times faster than the popular Farrar's algorithm for protein global alignments. Conclusions: Our algorithm has applications in computing global alignments and X-drop alignments on proteins and long reads. It is available as a Rust library at https://github.com/Daniel-Liu-c0deb0t/block-aligner.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12363
Author(s):  
Paul M. Harrison

Compositionally-biased (CB) regions in biological sequences are enriched for a subset of sequence residue types. These can be shorter regions with a concentrated bias (i.e., those termed ‘low-complexity’), or longer regions that have a compositional skew. These regions comprise a prominent class of the uncharacterized ‘dark matter’ of the protein universe. Here, I report the latest version of the fLPS package for the annotation of CB regions, which includes added consideration of DNA sequences, to label the eight possible biased regions of DNA. In this version, the user is now able to restrict analysis to a specified subset of residue types, and also to filter for previously annotated domains to enable detection of discontinuous CB regions. A ‘thorough’ option has been added which enables the labelling of subtler biases, typically made from a skew for several residue types. In the output, protein CB regions are now labelled with bias classes reflecting the physico-chemical character of the biasing residues. The fLPS 2.0 package is available from: https://github.com/pmharrison/flps2 or in a Supplemental File of this paper.


2021 ◽  
Vol 17 (9) ◽  
pp. e1008991
Author(s):  
Spencer L. Nystrom ◽  
Daniel J. McKay

Identification of biopolymer motifs represents a key step in the analysis of biological sequences. The MEME Suite is a widely used toolkit for comprehensive analysis of biopolymer motifs; however, these tools are poorly integrated within popular analysis frameworks like the R/Bioconductor project, creating barriers to their use. Here we present memes, an R package that provides a seamless R interface to a selection of popular MEME Suite tools. memes provides a novel “data aware” interface to these tools, enabling rapid and complex discriminative motif analysis workflows. In addition to interfacing with popular MEME Suite tools, memes leverages existing R/Bioconductor data structures to store the multidimensional data returned by MEME Suite tools for rapid data access and manipulation. Finally, memes provides data visualization capabilities to facilitate communication of results. memes is available as a Bioconductor package at https://bioconductor.org/packages/memes, and the source code can be found at github.com/snystrom/memes.


Author(s):  
Guillermo Restrepo

: The deluge of biological sequences ranging from those of proteins, DNA and RNA to genomes has increased the models for their representation, which are further used to contrast those sequences. Here we present a brief bibliometric description of the research area devoted to representation of biological sequences and highlight the semiotic reaches of this process. Finally, we argue that this research area needs further research according to the evolution of mathematical chemistry and its drawbacks are required to be overcome.


Sign in / Sign up

Export Citation Format

Share Document