assembly algorithm Latest Research Papers

The assembly of contiguous sequence from metagenomic samples presents a particular challenge, due to the presence of multiple species, often closely related, at varying levels of abundance. Capturing diversity within species, for example viral haplotypes, or bacterial strain-level diversity, is even more challenging. We present MetaCortex, a metagenome assembler based on data structures from the Cortex de novo assembler. MetaCortex captures intra-species diversity by searching for signatures of local variation along assembled sequences in the underlying assembly graph and outputting these sequences in sequence graph format. MetaCortex also implements a novel assembly algorithm for representing intra-species diversity in standard linear format. We show that MetaCortex produces accurate assemblies with higher genome coverage and contiguity than other popular metagenomic assemblers on mock viral communities with high levels of strain level diversity, and on simulated communities containing simulated strains. We also show that accuracy can be increased further by using the sequence graph produced by MetaCortex to create highly accurate single contig sequences.

Download Full-text

A generalizable data assembly algorithm for infectious disease outbreaks

JAMIA Open ◽

10.1093/jamiaopen/ooab058 ◽

2021 ◽

Vol 4 (3) ◽

Author(s):

Maimuna S Majumder ◽

Sherri Rose

Keyword(s):

Infectious Disease ◽

Information Sources ◽

Disease Outbreaks ◽

Implementation Process ◽

Validation Data ◽

Data Set ◽

Assembly Algorithm ◽

Infectious Disease Outbreaks ◽

Health Agencies ◽

Machine Readable

Abstract During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across 3 outbreaks. After developing an algorithm with regular expressions, we automatically curated data from health agencies via 3 information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak, and an implementation process was presented for application to future outbreaks. When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all 3 outbreaks. Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.

Download Full-text

What do Eulerian and Hamiltonian cycles have to do with genome assembly?

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008928 ◽

2021 ◽

Vol 17 (5) ◽

pp. e1008928

Author(s):

Paul Medvedev ◽

Mihai Pop

Keyword(s):

Genome Assembly ◽

Linear Time ◽

Hamiltonian Cycles ◽

De Bruijn Graphs ◽

Genome Reconstruction ◽

Assembly Algorithm ◽

A Genome ◽

De Bruijn ◽

Do So

Many students are taught about genome assembly using the dichotomy between the complexity of finding Eulerian and Hamiltonian cycles (easy versus hard, respectively). This dichotomy is sometimes used to motivate the use of de Bruijn graphs in practice. In this paper, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems. We give 2 arguments. The first is that a genome reconstruction is never unique and hence an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. The second is that even if an arbitrary genome reconstruction was desired, one could do so in linear time in both the Eulerian and Hamiltonian paradigms.

Download Full-text

3D DNA Self-Assembly Algorithmic Model to Solve the Hamiltonian Path Problem

Journal of Nanoelectronics and Optoelectronics ◽

10.1166/jno.2021.3000 ◽

2021 ◽

Vol 16 (5) ◽

pp. 731-737

Author(s):

Jingjing Ma

Keyword(s):

Self Assembly ◽

Dna Computing ◽

Computing Time ◽

Hamiltonian Path ◽

Experimental Methods ◽

Deterministic Algorithm ◽

Assembly Algorithm ◽

Hamiltonian Path Problem ◽

Dna Tiles ◽

Algorithmic Model

Self-assembly reveals the innate character of DNA computing, DNA self-assembly is regarded as the best way to make DNA computing transform into computer chip. This paper introduces a strategy of DNA 3D self-assembly algorithm to solve the Hamiltonian Path Problem. Firstly, I introduced a non-deterministic algorithm. Then, according to the algorithm I designed the types of DNA tiles which the computing process needs. Lastly, I demonstrated the self-assembly process and the experimental methods which can get the final result. The computing time is linear, and the number of the different tile types is constant.

Download Full-text

A Generalizable Data Assembly Algorithm for Infectious Disease Outbreaks

10.1101/2021.04.21.21255862 ◽

2021 ◽

Author(s):

Maimuna S. Majumder ◽

Sherri Rose

Keyword(s):

Infectious Disease ◽

Information Sources ◽

Disease Outbreaks ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Assembly Algorithm ◽

Infectious Disease Outbreaks ◽

Health Agencies ◽

Machine Readable

AbstractBackground & ObjectiveDuring infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is usually text-based and rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across three outbreaks.MethodsAfter developing an algorithm with regular expressions, we automatically curated data from health agencies via three information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak.FindingsWhen compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all three outbreaks.ConclusionsWithin the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.

Download Full-text

Deeplasmid: Deep learning accurately separates plasmids from bacterial chromosomes

10.1101/2021.03.11.434936 ◽

2021 ◽

Author(s):

William B Andreopoulos ◽

Alexander M Geller ◽

Miriam Lucke ◽

Jan Balewski ◽

Alicia Clum ◽

...

Keyword(s):

Deep Learning ◽

High Reliability ◽

Biological Data ◽

Yersinia Ruckeri ◽

Microbial Genomes ◽

Sequencing Platform ◽

Assembly Algorithm ◽

Antimicrobial Resistance Genes ◽

Bacterial Chromosomes ◽

Long Read

AbstractPlasmids are mobile genetic elements that play a key role in microbial ecology and evolution by mediating horizontal transfer of important genes, such as antimicrobial resistance genes. Many microbial genomes have been sequenced by short read sequencers and have resulted in a mix of contigs that derive from plasmids or chromosomes. New tools that accurately identify plasmids are needed to elucidate new plasmid-borne genes of high biological importance. We have developed Deeplasmid, a deep learning tool for distinguishing plasmids from bacterial chromosomes based on the DNA sequence and its encoded biological data. It requires as input only assembled sequences generated by any sequencing platform and assembly algorithm and its runtime scales linearly with the number of assembled sequences. Deeplasmid achieves an AUC-ROC of over 93%, and it was much more precise than the state-of-the-art methods. Finally, as a proof of concept, we used Deeplasmid to predict new plasmids in the fish pathogen Yersinia ruckeri ATCC 29473 that has no annotated plasmids. Deeplasmid predicted with high reliability that a long assembled contig is part of a plasmid. Using long read sequencing we indeed validated the existence of a 102 Kbp long plasmid, demonstrating Deeplasmid’s ability to detect novel plasmids.AvailabilityThe software is available with a BSD license: deeplasmid.sourceforge.io. A Docker container is available on DockerHub under: billandreo/[email protected]@mail.huji.ac.il

Download Full-text

Gene Sequence Assembly Algorithm Model Based on the DBG Strategy and Its Application

Journal of Healthcare Engineering ◽

10.1155/2021/6676194 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Haihe Shi ◽

Gang Wu

Keyword(s):

Gene Sequence ◽

Sequence Assembly ◽

Feature Model ◽

De Bruijn Graph ◽

Specific Sequence ◽

Component Library ◽

Assembly Algorithm ◽

Assembly Algorithms ◽

Abstract Algorithm ◽

Domain Level

With the continuous development of sequencing technology, the amount of bioinformatics data has increased geometrically, and the massive amount of bioinformatics data puts forward more stringent requirements for sequence assembly problems. The sequence assembly algorithm based on DBG (De Bruijn graph) strategy is a key algorithm in bioinformatics, which is widely used in the domain of gene sequence assembly. Current research on the domain of sequence assembly always focuses on optimization of specific steps to a specific algorithm and lack of research on domain-level high-abstract algorithm frameworks. To some extent, it leads to the redundancy of the sequence assembly algorithm, and some problems may be caused by the artificial selection algorithm. This paper analyzes the domain of DBGSA and establishes a feature model of this domain. Based on the production programming method, the DBGSA algorithm component is interactively designed. With the support of the PAR platform, the DBGSA algorithm component library is formally implemented, and furthermore, the DBGSA component library is used to assemble the specific algorithm. This research adds domain-level research to the domain of sequence assembly and implements the DBGSA component library, which can assemble specific sequence assembly algorithms, ensuring the efficiency of algorithm development and the reliability of assembly generation algorithms. At the same time, it also provides a valuable reference for solving problems in the domain of sequence assembly.

Download Full-text

A New Implementation of Genome Rearrangement Problem

Journal of Healthcare Engineering ◽

10.1155/2021/6692775 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Xiaoqian Jing ◽

Haihe Shi

Keyword(s):

Simulated Annealing ◽

Branch And Bound ◽

Genome Rearrangement ◽

Greedy Algorithms ◽

Practical Application ◽

Component Library ◽

Greedy Strategy ◽

Assembly Algorithm ◽

Rearrangement Algorithm ◽

Biological Similarity

Unsigned reverse genome rearrangement is an important part of bioinformatics research, which is widely used in biological similarity and homology analysis, revealing biological inheritance, variation, and evolution. Branch and bound, simulated annealing, and other algorithms in unsigned reverse genome rearrangement algorithm are rare in practical application because of their huge time and space consumption, and greedy algorithms are mostly used at present. By deeply analyzing the domain of unsigned reverse genome rearrangement algorithm based on greedy strategy (unsigned reverse genome rearrangement algorithm (URGRA) based on greedy strategy), the domain features are modeled, and the URGRA algorithm components are interactively designed according to the production programming method. With the support of the PAR platform, the algorithm component library of the URGRA is formally realized, and the concrete algorithm is generated by assembly, which improves the reliability of the assembly algorithm.

Download Full-text

ComHapDet: a spatial community detection algorithm for haplotype assembly

BMC Genomics ◽

10.1186/s12864-020-06935-x ◽

2020 ◽

Vol 21 (S9) ◽

Cited By ~ 2

Author(s):

Abishek Sankararaman ◽

Haris Vikalo ◽

François Baccelli

Keyword(s):

Community Detection ◽

High Throughput Sequencing ◽

Graphical Representation ◽

Detection Algorithm ◽

Sequencing Data ◽

Assembly Algorithm ◽

Haplotype Assembly ◽

Therapeutic Drugs ◽

High Throughput Sequencing Data ◽

Community Detection Algorithm

Abstract Background Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual’s susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. Results We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose – a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. Conclusions Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.

Download Full-text