A path recorder algorithm for Multiple Longest Common Subsequences (MLCS) problems

Shiwei Wei; Yuping Wang; Yuanchao Yang; Sen Liu

doi:10.1093/bioinformatics/btaa134

A path recorder algorithm for Multiple Longest Common Subsequences (MLCS) problems

Bioinformatics ◽

10.1093/bioinformatics/btaa134 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3035-3042

Author(s):

Shiwei Wei ◽

Yuping Wang ◽

Yuanchao Yang ◽

Sen Liu

Keyword(s):

Dna Sequences ◽

Directed Acyclic Graph ◽

Large Scale ◽

Search Time ◽

Longest Common Subsequence ◽

Supplementary Information ◽

Acyclic Graph ◽

Longest Common Subsequences ◽

Longest Path ◽

Longest Paths

Abstract Motivation Searching the Longest Common Subsequences of many sequences is called a Multiple Longest Common Subsequence (MLCS) problem which is a very fundamental and challenging problem in many fields of data mining. The existing algorithms cannot be applicable to problems with long and large-scale sequences due to their huge time and space consumption. To efficiently handle large-scale MLCS problems, a Path Recorder Directed Acyclic Graph (PRDAG) model and a novel Path Recorder Algorithm (PRA) are proposed. Results In PRDAG, we transform the MLCS problem into searching the longest path from the Directed Acyclic Graph (DAG), where each longest path in DAG corresponds to an MLCS. To tackle the problem efficiently, we eliminate all redundant and repeated nodes during the construction of DAG, and for each node, we only maintain the longest paths from the source node to it but ignore all non-longest paths. As a result, the size of the DAG becomes very small, and the memory space and search time will be greatly saved. Empirical experiments have been performed on a standard benchmark set of both DNA sequences and protein sequences. The experimental results demonstrate that our model and algorithm outperform the related leading algorithms, especially for large-scale MLCS problems. Availability and implementation This program code is written by the first author and can be available at https://www.ncbi.nlm.nih.gov/nuccore and https://blog.csdn.net/wswguilin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A fast and memory efficient MLCS algorithm by character merging for DNA sequences alignment

Bioinformatics ◽

10.1093/bioinformatics/btz725 ◽

2019 ◽

Author(s):

Sen Liu ◽

Yuping Wang ◽

Wuning Tong ◽

Shiwei Wei

Keyword(s):

Dna Sequences ◽

Directed Acyclic Graph ◽

Large Scale ◽

State Of The Art ◽

Longest Common Subsequence ◽

Sequence Length ◽

Acyclic Graph ◽

Character Sequences ◽

Common Subsequence ◽

Memory Efficient

Abstract Motivation Multiple longest common subsequence (MLCS) problem is searching all longest common subsequences of multiple character sequences. It appears in many fields such as data mining, DNA alignment, bioinformatics, text editing and so on. With the increasing in sequence length and number of sequences, the existing dynamic programming algorithms and the dominant point-based algorithms become ineffective and inefficient, especially for large-scale MLCS problems. Results In this paper, by considering the characteristics of DNA sequences with many consecutively repeated characters, we first design a character merging scheme which merges the consecutively repeated characters in the sequences. As a result, it shortens the length of sequences considered and saves the space of storing all sequences. To further reduce the space and time costs, we construct a weighted directed acyclic graph which is much smaller than widely used directed acyclic graph for MLCS problems. Based on these techniques, we propose a fast and memory efficient algorithm for MLCS problems. Finally, the experiments are conducted and the proposed algorithm is compared with several state-of-the art algorithms. The experimental results show that the proposed algorithm performs better than the compared state-of-the art algorithms in both time and space costs. Availability and implementation https://www.ncbi.nlm.nih.gov/nuccore and https://github.com/liusen1006/MLCS.

Download Full-text

A Parallel Programming Pattern Based on Directed Acyclic Graph

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.303-306.2165 ◽

2013 ◽

Vol 303-306 ◽

pp. 2165-2169

Author(s):

Zheng Meng ◽

Ying Lin ◽

Yan Kang ◽

Qian Yu

Keyword(s):

Parallel Programming ◽

Computer Technology ◽

Directed Acyclic Graph ◽

Large Scale ◽

Algorithm Design ◽

Batch Processing ◽

Acyclic Graph ◽

Static Data ◽

Definition Of

With the development of computer technology, multi-core programming is now becoming hot issues. Based on directed acyclic graph, this paper gives definition of a number of executable operations and establishes a parallel programming pattern. Using verticies to represent tasks and edges to represent communication between vertex, this parallel programming pattern let the programmers easily to identify the available concurrency and expose it for use in the algorithm design. The proposed pattern can be used for large-scale static data batch processing in multi-core environments and can bring lots of convenience when deal with complex issues.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Recalculating the Length of the Longest Path in Perturbed Directed Acyclic Graph

IFAC-PapersOnLine ◽

10.1016/j.ifacol.2019.11.422 ◽

2019 ◽

Vol 52 (13) ◽

pp. 1560-1565 ◽

Cited By ~ 1

Author(s):

Golshan Madraki ◽

Robert P. Judd

Keyword(s):

Directed Acyclic Graph ◽

Acyclic Graph ◽

Longest Path

Download Full-text

annonex2embl: automatic preparation of annotated DNA sequences for bulk submissions to ENA

Bioinformatics ◽

10.1093/bioinformatics/btaa209 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3841-3848

Author(s):

Michael Gruenstaeudl

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Sequence Data ◽

Complete Sequence ◽

Supplementary Information ◽

Biological Research ◽

Sequence Alignments ◽

Easy Integration ◽

Central Pillar ◽

Python Package

Abstract Motivation The submission of annotated sequence data to public sequence databases constitutes a central pillar in biological research. The surge of novel DNA sequences awaiting database submission due to the application of next-generation sequencing has increased the need for software tools that facilitate bulk submissions. This need has yet to be met with the concurrent development of tools to automate the preparatory work preceding such submissions. Results The author introduce annonex2embl, a Python package that automates the preparation of complete sequence flatfiles for large-scale sequence submissions to the European Nucleotide Archive. The tool enables the conversion of DNA sequence alignments that are co-supplied with sequence annotations and metadata to submission-ready flatfiles. Among other features, the software automatically accounts for length differences among the input sequences while maintaining correct annotations, automatically interlaces metadata to each record and displays a design suitable for easy integration into bioinformatic workflows. As proof of its utility, annonex2embl is employed in preparing a dataset of more than 1500 fungal DNA sequences for database submission. Availability and implementation annonex2embl is freely available via the Python package index at http://pypi.python.org/pypi/annonex2embl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepEventMine: end-to-end neural nested event extraction from biomedical texts

Bioinformatics ◽

10.1093/bioinformatics/btaa540 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4910-4917

Author(s):

Hai-Long Trieu ◽

Thy Thy Tran ◽

Khoa N A Duong ◽

Anh Nguyen ◽

Makoto Miwa ◽

...

Keyword(s):

Directed Acyclic Graph ◽

State Of The Art ◽

Event Extraction ◽

Supplementary Information ◽

Supplementary Data ◽

General Domain ◽

Acyclic Graph ◽

End To End ◽

Biomedical Texts ◽

Extraction Model

Abstract Motivation Recent neural approaches on event extraction from text mainly focus on flat events in general domain, while there are less attempts to detect nested and overlapping events. These existing systems are built on given entities and they depend on external syntactic tools. Results We propose an end-to-end neural nested event extraction model named DeepEventMine that extracts multiple overlapping directed acyclic graph structures from a raw sentence. On the top of the bidirectional encoder representations from transformers model, our model detects nested entities and triggers, roles, nested events and their modifications in an end-to-end manner without any syntactic tools. Our DeepEventMine model achieves the new state-of-the-art performance on seven biomedical nested event extraction tasks. Even when gold entities are unavailable, our model can detect events from raw text with promising performance. Availability and implementation Our codes and models to reproduce the results are available at: https://github.com/aistairc/DeepEventMine. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

10.1101/380824 ◽

2018 ◽

Cited By ~ 4

Author(s):

Benjamin T. James ◽

Brian B. Luczak ◽

Hani Z. Girgis

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Linear Models ◽

Linear Time ◽

Pairwise Alignment ◽

Supplementary Information ◽

General Linear ◽

General Linear Models ◽

Alignment Free ◽

Identity Score

AbstractMotivationPairwise alignment is a predominant algorithm in the field of bioinformatics. This algorithm is quadratic — slow especially on long sequences. Many applications utilize identity scores without the corresponding alignments. For these applications, we propose FASTCAR. It produces identity scores for pairs of DNA sequences using alignment-free methods and two self-supervised general linear models.ResultsFor the first time, the new tool can predict the pair-wise identity score in linear time and space. On two large-scale sequence databases, FASTCAR provided the best compromise between sensitivity and precision while being faster than BLAST by 40% and faster than USEARCH by 6–10 times. Further, FASTCAR is capable of producing the pair-wise identity scores of long DNA sequences — millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any alignment-based tool.AvailabilityFASTCAR is available at https://github.com/TulsaBioinformaticsToolsmith/FASTCAR and as the Supplementary Dataset [email protected] informationSupplementary data are available online.

Download Full-text

Rapid, large-scale species discovery in hyperdiverse taxa using 1D MinION sequencing

BMC Biology ◽

10.1186/s12915-019-0706-9 ◽

2019 ◽

Vol 17 (1) ◽

Cited By ~ 17

Author(s):

Amrita Srivathsan ◽

Emily Hartop ◽

Jayanthi Puniamoorthy ◽

Wan Ting Lee ◽

Sujatha Narayanan Kutty ◽

...

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Low Cost ◽

Small Body ◽

Small Subset ◽

Similar Species ◽

Large Species ◽

Morphological Examination ◽

Species Discovery ◽

Short Period

Abstract Background More than 80% of all animal species remain unknown to science. Most of these species live in the tropics and belong to animal taxa that combine small body size with high specimen abundance and large species richness. For such clades, using morphology for species discovery is slow because large numbers of specimens must be sorted based on detailed microscopic investigations. Fortunately, species discovery could be greatly accelerated if DNA sequences could be used for sorting specimens to species. Morphological verification of such “molecular operational taxonomic units” (mOTUs) could then be based on dissection of a small subset of specimens. However, this approach requires cost-effective and low-tech DNA barcoding techniques because well-equipped, well-funded molecular laboratories are not readily available in many biodiverse countries. Results We here document how MinION sequencing can be used for large-scale species discovery in a specimen- and species-rich taxon like the hyperdiverse fly family Phoridae (Diptera). We sequenced 7059 specimens collected in a single Malaise trap in Kibale National Park, Uganda, over the short period of 8 weeks. We discovered > 650 species which exceeds the number of phorid species currently described for the entire Afrotropical region. The barcodes were obtained using an improved low-cost MinION pipeline that increased the barcoding capacity sevenfold from 500 to 3500 barcodes per flowcell. This was achieved by adopting 1D sequencing, resequencing weak amplicons on a used flowcell, and improving demultiplexing. Comparison with Illumina data revealed that the MinION barcodes were very accurate (99.99% accuracy, 0.46% Ns) and thus yielded very similar species units (match ratio 0.991). Morphological examination of 100 mOTUs also confirmed good congruence with morphology (93% of mOTUs; > 99% of specimens) and revealed that 90% of the putative species belong to the neglected, megadiverse genus Megaselia. We demonstrate for one Megaselia species how the molecular data can guide the description of a new species (Megaselia sepsioides sp. nov.). Conclusions We document that one field site in Africa can be home to an estimated 1000 species of phorids and speculate that the Afrotropical diversity could exceed 200,000 species. We furthermore conclude that low-cost MinION sequencers are very suitable for reliable, rapid, and large-scale species discovery in hyperdiverse taxa. MinION sequencing could quickly reveal the extent of the unknown diversity and is especially suitable for biodiverse countries with limited access to capital-intensive sequencing facilities.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text