Localized Genome Assembly from Reads to Scaffolds: Practical Traversal of the Paired String Graph

Using Apache Spark on genome assembly for scalable overlap-graph reduction

Human Genomics ◽

10.1186/s40246-019-0227-1 ◽

2019 ◽

Vol 13 (S1) ◽

Cited By ~ 1

Author(s):

Alexander J. Paul ◽

Dylan Lawrence ◽

Myoungkyu Song ◽

Seung-Hwan Lim ◽

Chongle Pan ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Time Frame ◽

Apache Spark ◽

Reference Sequence ◽

Graph Reduction ◽

De Novo Genome Assembly ◽

String Graph ◽

Edge Graph ◽

Generation Sequencing

Abstract Background De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. Results To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. Conclusions We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Download Full-text

GAMS: Genome Assembly on Multi-GPU Using String Graph

2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) ◽

10.1109/hpcc-smartcity-dss.2016.0057 ◽

2016 ◽

Author(s):

Gaurav Jain ◽

Lalchand Rathore ◽

Kolin Paul

Keyword(s):

Genome Assembly ◽

String Graph

Download Full-text

GraphSeq: Accelerating String Graph Construction for De Novo Assembly on Spark

10.1101/321729 ◽

2018 ◽

Author(s):

Chung-Tsai Su ◽

Ming-Tai Chang ◽

Yun-Chian Cheng ◽

Yun-Lung Li ◽

Yao-Ting Wang

Keyword(s):

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Data Representation ◽

Important Application ◽

Supplementary Information ◽

De Novo Genome Assembly ◽

String Graph ◽

Computing Framework ◽

Variant Identification

AbstractSummary: De novo genome assembly is an important application on both uncharacterized genome assembly and variant identification in a reference-unbiased way. In comparison with de Brujin graph, string graph is a lossless data representation for de novo assembly. However, string graph construction is computational intensive. We propose GraphSeq to accelerate string graph construction by leveraging the distributed computing framework.Availability and Implementation: GraphSeq is implemented with Scala on Spark and freely available at https://www.atgenomix.com/blog/graphseq.Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps49936.2021.00060 ◽

2021 ◽

Author(s):

Giulia Guidi ◽

Oguz Selvitopi ◽

Marquita Ellis ◽

Leonid Oliker ◽

Katherine Yelick ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

De Novo Genome Assembly ◽

Transitive Reduction ◽

String Graph

Download Full-text

ASA³P: An automatic and highly scalable pipeline for bacterial genome assembly, annotation and higher-level analyses

10.26226/morressier.5991c407d462b80292388dda ◽

2017 ◽

Author(s):

Oliver Schwengers

Keyword(s):

Genome Assembly ◽

Bacterial Genome

Download Full-text

Faculty Opinions recommendation of Genome assembly comparison identifies structural variants in the human genome.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1057975.509926 ◽

2007 ◽

Author(s):

Ulf Pettersson

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Structural Variants

Download Full-text

Faculty Opinions recommendation of How can a high-quality genome assembly help plant breeders?

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.735958664.793561468 ◽

2019 ◽

Author(s):

Dirk Hincha

Keyword(s):

Genome Assembly ◽

High Quality ◽

High Quality Genome

Download Full-text

A Chromosome-Level Genome Assembly Identifies the Origin of Neo-Y Chromosome to the Unique X 1X 2Y System

SSRN Electronic Journal ◽

10.2139/ssrn.3509876 ◽

2019 ◽

Author(s):

Yongshuang Xiao ◽

Zhizhong Xiao ◽

Daoyuan Ma ◽

Chenxi Zhao ◽

Lin Liu ◽

...

Keyword(s):

Y Chromosome ◽

Genome Assembly ◽

Chromosome Level

Download Full-text

A high‐quality carabid genome assembly provides insights into beetle genome evolution and cold adaptation

Molecular Ecology Resources ◽

10.1111/1755-0998.13409 ◽

2021 ◽

Author(s):

Yi‐Ming Weng ◽

Charlotte B. Francoeur ◽

Cameron R. Currie ◽

David H. Kavanaugh ◽

Sean D. Schoville

Keyword(s):

Genome Evolution ◽

Genome Assembly ◽

Cold Adaptation ◽

High Quality

Download Full-text

Chromosome-level genome assembly of a regenerable maize inbred line A188

Genome Biology ◽

10.1186/s13059-021-02396-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Guifang Lin ◽

Cheng He ◽

Jun Zheng ◽

Dal-Hoe Koo ◽

Ha Le ◽

...

Keyword(s):

Inbred Line ◽

Genome Assembly ◽

Gene Function ◽

Maize Inbred Line ◽

Carotenoid Cleavage Dioxygenase ◽

Structural Variations ◽

Embryonic Callus ◽

Network Analyses ◽

A Genome ◽

Chromosome Level

Abstract Background The maize inbred line A188 is an attractive model for elucidation of gene function and improvement due to its high embryogenic capacity and many contrasting traits to the first maize reference genome, B73, and other elite lines. The lack of a genome assembly of A188 limits its use as a model for functional studies. Results Here, we present a chromosome-level genome assembly of A188 using long reads and optical maps. Comparison of A188 with B73 using both whole-genome alignments and read depths from sequencing reads identify approximately 1.1 Gb of syntenic sequences as well as extensive structural variation, including a 1.8-Mb duplication containing the Gametophyte factor1 locus for unilateral cross-incompatibility, and six inversions of 0.7 Mb or greater. Increased copy number of carotenoid cleavage dioxygenase 1 (ccd1) in A188 is associated with elevated expression during seed development. High ccd1 expression in seeds together with low expression of yellow endosperm 1 (y1) reduces carotenoid accumulation, accounting for the white seed phenotype of A188. Furthermore, transcriptome and epigenome analyses reveal enhanced expression of defense pathways and altered DNA methylation patterns of the embryonic callus. Conclusions The A188 genome assembly provides a high-resolution sequence for a complex genome species and a foundational resource for analyses of genome variation and gene function in maize. The genome, in comparison to B73, contains extensive intra-species structural variations and other genetic differences. Expression and network analyses identify discrete profiles for embryonic callus and other tissues.

Download Full-text