SeqRepo: A system for managing local collections of biological sequences

Motivation Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility. Results Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets. Availability SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.

Download Full-text

SeqRepo: A system for managing local collections biological sequences

10.1101/2020.09.16.299495 ◽

2020 ◽

Author(s):

Reece K. Hart ◽

Andreas Prlić

Keyword(s):

Programming Languages ◽

High Performance ◽

Sequence Data ◽

Random Access ◽

Biological Sequences ◽

Biological Sequence ◽

Public And Private ◽

Human Sequence ◽

Local Sequence ◽

Sequence Identifier

AbstractMotivationAccess to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility.ResultsHere we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol.SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available.It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets.AvailabilitySeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.

Download Full-text

Generative probabilistic biological sequence models that account for mutational variability

10.1101/2020.07.31.231381 ◽

2020 ◽

Author(s):

Eli N. Weinstein ◽

Debora S. Marks

Keyword(s):

Large Scale ◽

Sequence Data ◽

Disordered Proteins ◽

Biological Sequences ◽

Biological Sequence ◽

Multiple Sequence ◽

Continuous Space ◽

Future Evolution ◽

Disordered Protein ◽

Latent Representations

AbstractLarge-scale sequencing has revealed extraordinary diversity among biological sequences, produced over the course of evolution and within the lifetime of individual organisms. Existing methods for building statistical models of sequences often pre-process the data using multiple sequence alignment, an unreliable approach for many genetic elements (antibodies, disordered proteins, etc.) that is subject to fundamental statistical pathologies. Here we introduce a structured emission distribution (the MuE distribution) that accounts for mutational variability (substitutions and indels) and use it to construct generative and predictive hierarchical Bayesian models (H-MuE models). Our framework enables the application of arbitrary continuous-space vector models (e.g. linear regression, factor models, image neural-networks) to unaligned sequence data. Theoretically, we show that the MuE generalizes classic probabilistic alignment models. Empirically, we show that H-MuE models can infer latent representations and features for immune repertoires, predict functional unobserved members of disordered protein families, and forecast the future evolution of pathogens.

Download Full-text

Mapping Biomolecular Sequences: Graphical Representations - their Origins, Applications and Future Prospects

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207324666210510164743 ◽

2021 ◽

Vol 24 ◽

Author(s):

Ashesh Nandy

Keyword(s):

Dna Sequences ◽

Graphical Representation ◽

Sequence Data ◽

Basic Unit ◽

Graphical Representations ◽

Biological Sequences ◽

Biological Sequence ◽

New Approach ◽

3D Space ◽

2D And 3D

The exponential growth in the depositories of biological sequence data have generated an urgent need to store, retrieve and analyse the data efficiently and effectively for which the standard practice of using alignment procedures are not adequate due to high demand on computing resources and time. Graphical representation of sequences has become one of the most popular alignment-free strategies to analyse the biological sequences where each basic unit of the sequences – the bases adenine, cytosine, guanine and thymine for DNA/RNA, and the 20 amino acids for proteins – are plotted on a multi-dimensional grid. The resulting curve in 2D and 3D space and the implied graph in higher dimensions provide a perception of the underlying information of the sequences through visual inspection; numerical analyses, in geometrical or matrix terms, of the plots provide a measure of comparison between sequences and thus enable study of sequence hierarchies. The new approach has also enabled studies of comparisons of DNA sequences over many thousands of bases and provided new insights into the structure of the base compositions of DNA sequences In this article we review in brief the origins and applications of graphical representations and highlight the future perspectives in this field.

Download Full-text

A deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences

10.1101/2020.11.07.372524 ◽

2020 ◽

Author(s):

Chao Wei ◽

Junying Zhang ◽

Xiguo Yuan ◽

Zongzhen He ◽

Guojun Liu

Keyword(s):

Deep Learning ◽

Noncoding Rna ◽

Order Information ◽

Biological Sequences ◽

Biological Sequence ◽

Coding Region ◽

Protein Coding ◽

Learning Framework ◽

Coding Regions ◽

Local Sequence

ABSTRACTProtein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping kmer, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. In fact, kmer features that count the occurrence frequency of trinucleotides only reflect the local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information. In viewing of the point, we here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploiting global sequence order information, non-overlapping kmer features and statistical dependencies among coding labels. Evaluated on genomic and transcript sequences, our proposed method significantly outperforms existing state-of-the-art methods.

Download Full-text

FASTAFS: file system virtualisation of random access compressed FASTA files

BMC Bioinformatics ◽

10.1186/s12859-021-04455-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Youri Hoogstrate ◽

Guido W. Jenster ◽

Harmen J. G. van de Werken

Keyword(s):

Programming Languages ◽

File System ◽

State Of The Art ◽

Sequence Data ◽

Random Access ◽

Direct Access ◽

File Format ◽

Fasta File ◽

User Friendly ◽

System Access

Abstract Background The FASTA file format, used to store polymeric sequence data, has become a bioinformatics file standard used for decades. The relatively large files require additional files, beyond the scope of the original format, to identify sequences and to provide random access. Multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to FASTA files, resulting in limited software integration. Results We designed a linux based toolkit that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem by using filesystem in userspace. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using bit encodings plus Zstandard (zstd). The toolkit, FASTAFS, can track all its system-wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy. The file compression ratios were comparable but not superior to other state of the art archival tools, despite the innovative random access feature implemented in FASTAFS. Conclusions FASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers format conversion without the need to rewrite code into different programming languages while preserving compatibility.

Download Full-text

Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2019.4.3607 ◽

2019 ◽

Vol 14 (4) ◽

pp. 574-589

Author(s):

Linyan Xue ◽

Xiaoke Zhang ◽

Fei Xie ◽

Shuang Liu ◽

Peng Lin

Keyword(s):

Pattern Mining ◽

Sequence Data ◽

Biological Significance ◽

Frequent Pattern ◽

Frequent Patterns ◽

Biological Sequences ◽

Biological Sequence ◽

Protein Database ◽

Sequence Pattern ◽

Multiple Sequences

In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.

Download Full-text

Adaptive dating and fast proposals: revisiting the phylogenetic relaxed clock model

10.1101/2020.09.09.289124 ◽

2020 ◽

Author(s):

Jordan Douglas ◽

Rong Zhang ◽

Remco Bouckaert

Keyword(s):

Sequence Data ◽

Extreme Case ◽

Evolutionary Divergence ◽

Biological Sequences ◽

Biological Sequence ◽

Clock Model ◽

Phylogenetic Framework ◽

Bayesian Phylogenetic Inference ◽

Clock Models ◽

Relaxed Clock

AbstractUncorrelated relaxed clock models enable estimation of molecular substitution rates across lineages and are widely used in phylogenetics for dating evolutionary divergence times. In this article we delved into the internal complexities of the relaxed clock model in order to develop efficient MCMC operators for Bayesian phylogenetic inference. We compared three substitution rate parameterisations, introduced an adaptive operator which learns the weights of other operators during MCMC, and we explored how relaxed clock model estimation can benefit from two cutting-edge proposal kernels: the AVMVN and Bactrian kernels. This work has produced an operator scheme that is up to 65 times more efficient at exploring continuous relaxed clock parameters compared with previous setups, depending on the dataset. Finally, we explored variants of the standard narrow exchange operator which are specifically designed for the relaxed clock model. In the most extreme case, this new operator traversed tree space 40% more efficiently than narrow exchange. The methodologies introduced are adaptive and highly effective on short as well as long alignments. The results are available via the open source optimised relaxed clock (ORC) package for BEAST 2 under a GNU licence (https://github.com/jordandouglas/ORC).Author summaryBiological sequences, such as DNA, accumulate mutations over generations. By comparing such sequences in a phylogenetic framework, the evolutionary tree of lifeforms can be inferred. With the overwhelming availability of biological sequence data, and the increasing affordability of collecting new data, the development of fast and efficient phylogenetic algorithms is more important than ever. In this article we focus on the relaxed clock model, which is very popular in phylogenetics. We explored how a range of optimisations can improve the statistical inference of the relaxed clock. This work has produced a phylogenetic setup which can infer parameters related to the relaxed clock up to 65 times faster than previous setups, depending on the dataset. The methods introduced adapt to the dataset during computation and are highly efficient when processing long biological sequences.

Download Full-text

High Performance Pattern Matching on Heterogeneous Platform

Journal of Integrative Bioinformatics ◽

10.1515/jib-2014-253 ◽

2014 ◽

Vol 11 (3) ◽

pp. 88-98 ◽

Cited By ~ 1

Author(s):

Shima Soroushnia ◽

Masoud Daneshtalab ◽

Juha Plosila ◽

Tapio Pahikkala ◽

Pasi Liljeberg

Keyword(s):

Pattern Matching ◽

High Performance ◽

Sequence Data ◽

Good Choice ◽

Data Sets ◽

Biological Sequence ◽

Computational Molecular Biology ◽

Heterogeneous Architectures ◽

Protein Sequence Data ◽

Gpu Architecture

Summary Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expand the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures can be a good choice due to their potential for high performance and energy efficiency. In this paper we present an efficient implementation of Aho-Corasick (AC) which is a well known exact pattern matching algorithm with linear complexity, and Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned the algorithms and data structures to fit on the GPU architecture. Our results on different protein sequence data sets show that the new implementation runs 15 times faster compared to the original implementation of the PFAC algorithm.

Download Full-text

Improved Device Distribution in High-Performance SiNx Resistive Random Access Memory via Arsenic Ion Implantation

Nanomaterials ◽

10.3390/nano11061401 ◽

2021 ◽

Vol 11 (6) ◽

pp. 1401

Author(s):

Te Jui Yen ◽

Albert Chin ◽

Vladimir Gritsenko

Keyword(s):

High Performance ◽

Conduction Mechanism ◽

Random Access ◽

Random Access Memory ◽

Resistive Random Access Memory ◽

Switching Behavior ◽

Access Memory ◽

Current Voltage ◽

Current Voltage Characteristics ◽

Limited Conduction

Large device variation is a fundamental challenge for resistive random access memory (RRAM) array circuit. Improved device-to-device distributions of set and reset voltages in a SiNx RRAM device is realized via arsenic ion (As+) implantation. Besides, the As+-implanted SiNx RRAM device exhibits much tighter cycle-to-cycle distribution than the nonimplanted device. The As+-implanted SiNx device further exhibits excellent performance, which shows high stability and a large 1.73 × 103 resistance window at 85 °C retention for 104 s, and a large 103 resistance window after 105 cycles of the pulsed endurance test. The current–voltage characteristics of high- and low-resistance states were both analyzed as space-charge-limited conduction mechanism. From the simulated defect distribution in the SiNx layer, a microscopic model was established, and the formation and rupture of defect-conductive paths were proposed for the resistance switching behavior. Therefore, the reason for such high device performance can be attributed to the sufficient defects created by As+ implantation that leads to low forming and operation power.

Download Full-text

Predicting Chromosome Flexibility from the Genomic Sequence Based on Deep Learning Neural Networks

Current Bioinformatics ◽

10.2174/1574893616666210827095829 ◽

2021 ◽

Vol 16 ◽

Author(s):

Jinghao Peng ◽

Jiajie Peng ◽

Haiyin Piao ◽

Zhang Luo ◽

Kelin Xia ◽

...

Keyword(s):

Deep Learning ◽

High Performance ◽

Genomic Sequence ◽

Sequence Data ◽

Function Analysis ◽

Double Helix ◽

Gm12878 Cell ◽

Genomic Sequence Analysis ◽

And Function ◽

Nuclear Processes

Background: The open and accessible regions of the chromosome are more likely to be bound by transcription factors which are important for nuclear processes and biological functions. Studying the change of chromosome flexibility can help to discover and analyze disease markers and improve the efficiency of clinical diagnosis. Current methods for predicting chromosome flexibility based on Hi-C data include the flexibility-rigidity index (FRI) and the Gaussian network model (GNM), which have been proposed to characterize chromosome flexibility. However, these methods require the chromosome structure data based on 3D biological experiments, which is time-consuming and expensive. Objective: Generally, the folding and curling of the double helix sequence of DNA have a great impact on chromosome flexibility and function. Motivated by the success of genomic sequence analysis in biomolecular function analysis, we hope to propose a method to predict chromosome flexibility only based on genomic sequence data. Method: We propose a new method (named "DeepCFP") using deep learning models to predict chromosome flexibility based on only genomic sequence features. The model has been tested in the GM12878 cell line. Results: The maximum accuracy of our model has reached 91%. The performance of DeepCFP is close to FRI and GNM. Conclusion: The DeepCFP can achieve high performance only based on genomic sequence.

Download Full-text