biological sequence
Recently Published Documents


TOTAL DOCUMENTS

429
(FIVE YEARS 93)

H-INDEX

31
(FIVE YEARS 4)

Information ◽  
2022 ◽  
Vol 13 (1) ◽  
pp. 27
Author(s):  
Diego Garat ◽  
Dina Wonsever

In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a better experience of querying and navigation. In all cases, the curation process constitutes a bottleneck that slows down general access to the data, so it is of great interest to have automatic or semi-automatic curation processes. In this paper, we present a solution aimed at the automatic curation of our National Jurisprudence Database, with special focus on the process of the anonymization of personal information. The anonymization process aims to hide the names of the participants involved in a lawsuit without losing the meaning of the narrative of facts. In order to achieve this goal, we need, not only to recognize person names but also resolve co-references in order to assign the same label to all mentions of the same person. Our corpus has significant differences in the spelling of person names, so it was clear from the beginning that pre-existing tools would not be able to reach a good performance. The challenge was to find a good way of injecting specialized knowledge about person names syntax while taking profit of previous capabilities of pre-trained tools. We fine-tuned an NER analyzer and we built a clusterization algorithm to solve co-references between named entities. We present our first results, which, for both tasks, are promising: We obtained a 90.21% of F1-micro in the NER task—from a 39.99% score before retraining the same analyzer in our corpus—and a 95.95% ARI score in clustering for co-reference resolution.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Tim Anderson ◽  
Travis J. Wheeler

Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.


Author(s):  
Rosario Gilmary ◽  
Murugesan G

Deoxyribonucleic acid called DNA is the smallest fundamental unit that bears the genetic instructions of a living organism. It is used in the up growth and functioning of all known living organisms. Current DNA sequencing equipment creates extensive heaps of genomic data. The Nucleotide databases like GenBank, size getting 2 to 3 times larger annually. The increase in genomic data outstrips the increase in storage capacity. Massive amount of genomic data needs an effectual depository, quick transposal and preferable performance. To reduce storage of abundant data and data storage expense, compression algorithms were used. Typical compression approaches lose status while compressing these sequences. However, novel compression algorithms have been introduced for better compression ratio. The performance is correlated in terms of compression ratio; ratio of the capacity of compressed file and compression/decompression time; time taken to compress/decompress the sequence. In the proposed work, the input DNA sequence is compressed by reconstructing the sequence into varied formats. Here the input DNA sequence is subjected to bit reduction. The binary output is converted to hexadecimal format followed by encoding. Thus, the compression ratio of the biological sequence is improved.


2021 ◽  
Vol 118 (40) ◽  
pp. e2025782118
Author(s):  
Wei-Chia Chen ◽  
Juannan Zhou ◽  
Jason M. Sheltzer ◽  
Justin B. Kinney ◽  
David M. McCandlish

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5′ splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.


This paper presents a strategy to tackle the Multiple Sequence Alignment (MSA) problem, which is one of the most important tasks in the biological sequence analysis. Its role is to align the sequences in their entirety to derive relationships and common characteristics between a set of protein or nucleotide sequences. The MSA problem was proved to be an NP-Hard problem. The proposed strategy incorporates a new idea based on the well-known divide and conquer paradigm. This paper presents a novel method of clustering sequences as a preliminary step to improve the final alignment; this decomposition can be used as an optimization procedure with any MSA aligner to explore promising alignments of the search space. In their solution, authors proposed to align the clusters in a parallel and distributed way in order to benefit from parallel architectures. The strategy was tested using classical benchmarks like BAliBASE, Sabre, Prefab4 and Oxm, and the experimental results show that it gives good results by comparing to the other aligners.


Author(s):  
Rabah Lebsir ◽  
Abdesslem Layeb ◽  
Tahi Fariza

This paper presents a strategy to tackle the Multiple Sequence Alignment (MSA) problem, which is one of the most important tasks in the biological sequence analysis. Its role is to align the sequences in their entirety to derive relationships and common characteristics between a set of protein or nucleotide sequences. The MSA problem was proved to be an NP-Hard problem. The proposed strategy incorporates a new idea based on the well-known divide and conquer paradigm. This paper presents a novel method of clustering sequences as a preliminary step to improve the final alignment; this decomposition can be used as an optimization procedure with any MSA aligner to explore promising alignments of the search space. In their solution, authors proposed to align the clusters in a parallel and distributed way in order to benefit from parallel architectures. The strategy was tested using classical benchmarks like BAliBASE, Sabre, Prefab4 and Oxm, and the experimental results show that it gives good results by comparing to the other aligners.


2021 ◽  
Author(s):  
Samantha Petti ◽  
Sean R Eddy

Statistical inference and machine learning methods are benchmarked on test data independent of the data used to train the method. Biological sequence families are highly non-independent because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in bench marking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new meth- ods for splitting sequence data into dissimilar training and test sets. These algo rithms input a sequence family and produce a split in which each test sequence is less than p % identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.


2021 ◽  
Author(s):  
Hong-Liang Li ◽  
Yi-He Pang ◽  
Bin Liu

Abstract In order to uncover the meanings of ‘book of life’, 155 different biological language models (BLMs) for DNA, RNA and protein sequence analysis are discussed in this study, which are able to extract the linguistic properties of ‘book of life’. We also extend the BLMs into a system called BioSeq-BLM for automatically representing and analyzing the sequence data. Experimental results show that the predictors generated by BioSeq-BLM achieve comparable or even obviously better performance than the exiting state-of-the-art predictors published in literatures, indicating that BioSeq-BLM will provide new approaches for biological sequence analysis based on natural language processing technologies, and contribute to the development of this very important field. In order to help the readers to use BioSeq-BLM for their own experiments, the corresponding web server and stand-alone package are established and released, which can be freely accessed at http://bliulab.net/BioSeq-BLM/.


PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0257404
Author(s):  
Adam J. Kleinschmit ◽  
Elizabeth F. Ryder ◽  
Jacob L. Kerby ◽  
Barbara Murdoch ◽  
Sam Donovan ◽  
...  

As powerful computational tools and ‘big data’ transform the biological sciences, bioinformatics training is becoming necessary to prepare the next generation of life scientists. Furthermore, because the tools and resources employed in bioinformatics are constantly evolving, bioinformatics learning materials must be continuously improved. In addition, these learning materials need to move beyond today’s typical step-by-step guides to promote deeper conceptual understanding by students. One of the goals of the Network for Integrating Bioinformatics into Life Sciences Education (NIBSLE) is to create, curate, disseminate, and assess appropriate open-access bioinformatics learning resources. Here we describe the evolution, integration, and assessment of a learning resource that explores essential concepts of biological sequence similarity. Pre/post student assessment data from diverse life science courses show significant learning gains. These results indicate that the learning resource is a beneficial educational product for the integration of bioinformatics across curricula.


Author(s):  
Charles E. Grant ◽  
Timothy L. Bailey

AbstractXSTREME is a web-based tool for performing comprehensive motif discovery and analysis in DNA, RNA or protein sequences, as well as in sequences in user-defined alphabets. It is designed for both very large and very small datasets. XSTREME is similar to the MEME-ChIP tool, but expands upon its capabilities in several ways. Like MEME-ChIP, XSTREME performs two types of de novo motif discovery, and also performs motif enrichment analysis of the input sequences using databases of known motifs. Unlike MEME-ChIP, which ranks motifs based on their enrichment in the centers of the input sequences, XSTREME uses enrichment anywhere in the sequences for this purpose. Consequently, XSTREME is more appropriate for motif-based analysis of sequences regardless of how the motifs are distributed within the sequences. XSTREME uses the MEME and STREME algorithms for motif discovery, and the recently developed SEA algorithm for motif enrichment analysis. The interactive HTML output produced by XSTREME includes highly accurate motif significance estimates, plots of the positional distribution of each motif, and histograms of the number of motif matches in each sequences. XSTREME is easy to use via its web server at https://meme-suite.org, and is fully integrated with the widely-used MEME Suite of sequence analysis tools, which can be freely downloaded at the same web site for non-commercial use.


Sign in / Sign up

Export Citation Format

Share Document