sequence search Latest Research Papers

2FAST2Q: A general-purpose sequence search and counting program for FASTQ files

10.1101/2021.12.17.473121 ◽

2021 ◽

Author(s):

Afonso Bravo ◽

Athanasios Typas ◽

Jan-Willem Veening

Keyword(s):

Search Algorithm ◽

Amplicon Sequencing ◽

General Purpose ◽

Sequencing Data ◽

Sequence Search ◽

Executable File ◽

Downstream Analysis ◽

User Friendly ◽

Barcode Sequencing ◽

Current Operating

The increasingly widespread use of next generation sequencing protocols has brought the need for the development of user-friendly raw data processing tools. Here, we present 2FAST2Q, a versatile and intuitive standalone program capable of extracting and counting feature occurrences in FASTQ files. 2FAST2Q can be used in any experimental setup that requires feature extraction from raw reads, being able to quickly handle mismatch alignments, nucleotide wise Phred score filtering, custom read trimming, and sequence searching within a single program. Using published CRISPRi datasets in which Escherichia coli and Mycobacterium tuberculosis gene essentiality, as well as host-cell sensitivity towards SARS-CoV2 infectivity were tested, we demonstrate that 2FAST2Q efficiently recapitulates the output in read counts per provided feature as with traditional pipelines. Moreover, we show how different FASTQ read filtering parameters impact downstream analysis, and suggest a default usage protocol. 2FAST2Q has a familiar user interface and uses a custom sequence mismatch search algorithm, taking advantage of Pythons numba module JIT runtime speeds. It is thus easier to use and faster than currently available tools, efficiently processing large CRISPRi-Seq or random-barcode sequencing datasets on any up-to-date laptop. 2FAST2Q is available as an executable file for all current operating systems without installation and as a Python3 module on the PyPI repository (available at https://veeninglab.com/2fast2q). We expect that 2FAST2Q will not only be useful for people working in microbiology but also for other fields in which amplicon sequencing data is generated.

Download Full-text

Finite field and group algorithms for orthogonal sequence search

Information and Control Systems ◽

10.31799/1684-8853-2021-4-2-17 ◽

2021 ◽

pp. 2-17

Author(s):

Nikolay Balonin ◽

Alexander Sergeev ◽

Olga Sinitshina

Keyword(s):

Finite Fields ◽

Practical Importance ◽

Video Data ◽

Polynomial Equations ◽

Ideal Object ◽

Domestic Literature ◽

Finite Dimensional ◽

Sequence Search ◽

Lack Of Information ◽

First Time

Introduction: Hadamard matrices consisting of elements 1 and –1 are an ideal object for a visual application of finite dimensional mathematics operating with a finite number of addresses for –1 elements. The notation systems of abstract algebra methods, in contrast to the conventional matrix algebra, have been changing intensively, without being widely spread, leading to the necessity to revise and systematize the accumulated experience. Purpose: To describe the algorithms of finite fields and groups in a uniform notation in order to facilitate the perception of the extensive knowledge necessary for finding orthogonal and suborthogonal sequences. Results: Formulas have been proposed for calculating relatively unknown algorithms (or their versions) developed by Scarpis, Singer, Szekeres, Goethal — Seidel, and Noboru Ito, as well as polynomial equations used to prove the theorems about the existence of finite-dimensional solutions. This replenished the significant lack of information both in the domestic literature (most of these issues are published here for the first time) and abroad. Practical relevance: Orthogonal sequences and methods for their effective finding via the theory of finite fields and groups are of direct practical importance for noise-immune coding, compression and masking of video data.

Download Full-text

KEC: unique sequence search by K-mer exclusion

Bioinformatics ◽

10.1093/bioinformatics/btab196 ◽

2021 ◽

Author(s):

Pavel Beran ◽

Dagmar Stehlíková ◽

Stephen P Cohen ◽

Vladislav Čurn

Keyword(s):

Amino Acid ◽

Nucleic Acid ◽

Source Code ◽

Unique Sequence ◽

Supplementary Information ◽

Supplementary Data ◽

Laptop Computers ◽

Sequence Search ◽

Target Sequences ◽

Cross Reference

Abstract Summary Searching for amino acid or nucleic acid sequences unique to one organism may be challenging depending on size of the available datasets. K-mer elimination by cross-reference (KEC) allows users to quickly and easily find unique sequences by providing target and non-target sequences. Due to its speed, it can be used for datasets of genomic size and can be run on desktop or laptop computers with modest specifications. Availability and implementation KEC is freely available for non-commercial purposes. Source code and executable binary files compiled for Linux, Mac and Windows can be downloaded from https://github.com/berybox/KEC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Fungal GH25 muramidases: New family members with applications in animal nutrition and a crystal structure at 0.78Å resolution

PLoS ONE ◽

10.1371/journal.pone.0248190 ◽

2021 ◽

Vol 16 (3) ◽

pp. e0248190

Author(s):

Olga V. Moroz ◽

Elena Blagova ◽

Edward Taylor ◽

Johan P. Turkenburg ◽

Lars K. Skov ◽

...

Keyword(s):

Crystal Structure ◽

Cell Wall ◽

Glycoside Hydrolase ◽

Animal Feed ◽

Commercial Application ◽

Animal Nutrition ◽

Suitable Candidate ◽

New Family ◽

Sequence Search ◽

Bacterial Peptidoglycan

Muramidases/lysozymes hydrolyse the peptidoglycan component of the bacterial cell wall. They are found in many of the glycoside hydrolase (GH) families. Family GH25 contains muramidases/lysozymes, known as CH type lysozymes, as they were initially discovered in the Chalaropsis species of fungus. The characterized enzymes from GH25 exhibit both β-1,4-N-acetyl- and β-1,4-N,6-O-diacetylmuramidase activities, cleaving the β-1,4-glycosidic bond between N-acetylmuramic acid (NAM) and N-acetylglucosamine (NAG) moieties in the carbohydrate backbone of bacterial peptidoglycan. Here, a set of fungal GH25 muramidases were identified from a sequence search, cloned and expressed and screened for their ability to digest bacterial peptidoglycan, to be used in a commercial application in chicken feed. The screen identified the enzyme from Acremonium alcalophilum JCM 736 as a suitable candidate for this purpose and its relevant biochemical and biophysical and properties are described. We report the crystal structure of the A. alcalophilum enzyme at atomic, 0.78 Å resolution, together with that of its homologue from Trichobolus zukalii at 1.4 Å, and compare these with the structures of homologues. GH25 enzymes offer a new solution in animal feed applications such as for processing bacterial debris in the animal gut.

Download Full-text

Research of unstructured data interpretation problems

Российский технологический журнал ◽

10.32362/2500-316x-2021-9-1-7-17 ◽

2021 ◽

Vol 9 (1) ◽

pp. 7-17

Author(s):

V. S. Tomashevskaya ◽

D. A. Yakovlev

Keyword(s):

Data Visualization ◽

Data Interpretation ◽

Variance Analysis ◽

Unstructured Data ◽

Common Features ◽

Sequence Search ◽

Literary Sources ◽

Search Data

The term «unstructured data» means data that is unordered and arbitrary in shape. However, this type of information has a certain structure. Today there is a wide variety of data and, as a result, it is necessary to interpret them. Interpretation tasks include forecasting, classification, clustering, association, sequence search, data visualization, and variance analysis. The difficulty lies in the fact that the data itself can differ not only in terms of format, but also in terms of its structure. One of the key tasks when working with unstructured data is to find and identify patterns in order to understand them and develop filling patterns. The paper analyzes the rules for the design of bibliographic sources in order to identify common patterns. The concepts of structured and unstructured data are touched upon. The existing directions of work with unstructured data and methods of processing unstructured data, in particular, the rules for the design of bibliographic lists of literary sources, are considered. These rules were used to form templates consisting of semantic groups on the basis of examples of the corresponding lists of bibliographic sources. The final comparison of the obtained templates revealed both common features that unite all the considered templates and features that separate them.

Download Full-text

An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees

10.1101/2021.02.05.429839 ◽

2021 ◽

Author(s):

Fatemeh Almodaresi ◽

Jamshed Khan ◽

Sergey Madaminov ◽

Prashant Pandey ◽

Michael Ferdman ◽

...

Keyword(s):

Large Scale ◽

Supplementary Information ◽

De Bruijn Graph ◽

Sequencing Data ◽

Construction Time ◽

Graph Representations ◽

Sequence Search ◽

General Search ◽

Colored De Bruijn Graph ◽

Search Index

AbstractMotivationIn the past few years, researchers have proposed numerous indexing schemes for searching large databases of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes—Mantis, VariMerge, and Bifrost—that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data.ResultsIn this paper, we show how to build a scalable and updatable exact sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates. We demonstrate Mantis’s scalability by constructing an index of ≈ 40K samples from SRA by adding samples one at a time to an initial index of 10K samples.Compared to VariMerge and Bifrost, Mantis is more efficient in terms of index-construction time and memory, query time and memory, and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes.AvailabilityThe updatable Mantis implementation is available at https://github.com/splatlab/mantis/tree/[email protected] informationSupplementary data are available online.

Download Full-text

HpGAN: Sequence Search With Generative Adversarial Networks

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2021.3126944 ◽

2021 ◽

pp. 1-13

Author(s):

Mingxing Zhang ◽

Zhengchun Zhou ◽

Lanping Li ◽

Zilong Liu ◽

Meng Yang ◽

...

Keyword(s):

Generative Adversarial Networks ◽

Adversarial Networks ◽

Sequence Search

Download Full-text

Spectrally-Constrained Sequence Search Based on Simulated Annealing Algorithm

Computer Science and Application ◽

10.12677/csa.2021.113055 ◽

2021 ◽

Vol 11 (03) ◽

pp. 543-548

Author(s):

露霜韩

Keyword(s):

Simulated Annealing ◽

Simulated Annealing Algorithm ◽

Sequence Search ◽

Annealing Algorithm

Download Full-text

LISA: Learned Indexes for DNA Sequence Analysis

10.1101/2020.12.22.423964 ◽

2020 ◽

Author(s):

Darryl Ho ◽

Saurabh Kalikar ◽

Sanchit Misra ◽

Jialin Ding ◽

Vasimuddin Md ◽

...

Keyword(s):

Sequence Analysis ◽

Dna Sequence ◽

Dna Sequences ◽

State Of The Art ◽

The State ◽

Plant Genome ◽

Exact Match ◽

Sequence Search ◽

Genomics Tools ◽

Next Generation Sequencing Ngs

AbstractBackgroundNext-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences.ResultsIn this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. We focus on accelerating two of the most essential flavors of DNA sequence search—exact search and super-maximal exact match (SMEM) search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics tools. Experiments with human, animal, and plant genome datasets indicate that LISA achieves up to 2.2× and 13.3× speedups over the state-of-the-art FM-index based implementations for exact search and super-maximal exact match (SMEM) search, respectively.

Download Full-text