SpaRC: Scalable Sequence Clustering using Apache Spark

AbstractWhole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed a Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems. The software is available under the Apache 2.0 license at https://bitbucket.org/LizhenShi/sparc.

Download Full-text

Validation of variants using cost effective highresolution melting (HRM) analysis predicted from target re-sequencing in Eucalyptus

Acta Botanica Croatica ◽

10.37427/botcro-2020-019 ◽

2020 ◽

Vol 79 (2) ◽

pp. 105-113

Author(s):

Abdul Bari Muneera Parveen ◽

Divya Lakshmanan ◽

Modhumita Ghosh Dasgupta

Keyword(s):

Next Generation Sequencing ◽

Large Scale ◽

Sequence Data ◽

Cost Effective ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Time Saving ◽

Hrm Analysis ◽

The Cost ◽

Generation Sequencing

The advent of next-generation sequencing has facilitated large-scale discovery and mapping of genomic variants for high-throughput genotyping. Several research groups working in tree species are presently employing next generation sequencing (NGS) platforms for marker discovery, since it is a cost effective and time saving strategy. However, most trees lack a chromosome level genome map and validation of variants for downstream application becomes obligatory. The cost associated with identifying potential variants from the enormous amount of sequence data is a major limitation. In the present study, high resolution melting (HRM) analysis was optimized for rapid validation of single nucleotide polymorphisms (SNPs), insertions or deletions (InDels) and simple sequence repeats (SSRs) predicted from exome sequencing of parents and hybrids of Eucalyptus tereticornis Sm. ? Eucalyptus grandis Hill ex Maiden generated from controlled hybridization. The cost per data point was less than 0.5 USD, providing great flexibility in terms of cost and sensitivity, when compared to other validation methods. The sensitivity of this technology in variant detection can be extended to other applications including Bar-HRM for species authentication and TILLING for detection of mutants.

Download Full-text

Next-Generation Sequencing Technologies in Blood Group Typing

Transfusion Medicine and Hemotherapy ◽

10.1159/000504765 ◽

2019 ◽

Vol 47 (1) ◽

pp. 4-13 ◽

Cited By ~ 1

Author(s):

Daniel Fürst ◽

Chrysanthi Tsamadou ◽

Christine Neuchel ◽

Hubert Schrezenmeier ◽

Joannis Mytilineos ◽

...

Keyword(s):

Next Generation Sequencing ◽

Blood Group ◽

Large Scale ◽

Cost Effective ◽

Molecular Testing ◽

Blood Group Antigens ◽

Next Generation ◽

Sequencing Technologies ◽

Blood Group Typing ◽

Generation Sequencing

Sequencing of the human genome has led to the definition of the genes for most of the relevant blood group systems, and the polymorphisms responsible for most of the clinically relevant blood group antigens are characterized. Molecular blood group typing is used in situations where erythrocytes are not available or where serological testing was inconclusive or not possible due to the lack of antisera. Also, molecular testing may be more cost-effective in certain situations. Molecular typing approaches are mostly based on either PCR with specific primers, DNA hybridization, or DNA sequencing. Particularly the transition of sequencing techniques from Sanger-based sequencing to next-generation sequencing (NGS) technologies has led to exciting new possibilities in blood group genotyping. We describe briefly the currently available NGS platforms and their specifications, depict the genetic background of blood group polymorphisms, and discuss applications for NGS approaches in immunohematology. As an example, we delineate a protocol for large-scale donor blood group screening established and in use at our institution. Furthermore, we discuss technical challenges and limitations as well as the prospect for future developments, including long-read sequencing technologies.

Download Full-text

Quick and efficient approach to develop genomic resources in orphan species: Application in Lavandula angustifolia

PLoS ONE ◽

10.1371/journal.pone.0243853 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243853

Author(s):

Berline Fopa Fomeju ◽

Dominique Brunel ◽

Aurélie Bérard ◽

Jean-Baptiste Rivoal ◽

Philippe Gallois ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Rapid Development ◽

Genetic Distances ◽

Lavandula Angustifolia ◽

Distance Analysis ◽

Alternative Medicines ◽

Dna And Rna ◽

Snp Development ◽

High Level

Next-Generation Sequencing (NGS) technologies, by reducing the cost and increasing the throughput of sequencing, have opened doors to generate genomic data in a range of previously poorly studied species. In this study, we propose a method for the rapid development of a large-scale molecular resources for orphan species. We studied as an example the true lavender (Lavandula angustifolia Mill.), a perennial sub-shrub plant native from the Mediterranean region and whose essential oil have numerous applications in cosmetics, pharmaceuticals, and alternative medicines. The heterozygous clone “Maillette” was used as a reference for DNA and RNA sequencing. We first built a reference Unigene, compound of coding sequences, thanks to de novo RNA-seq assembly. Then, we reconstructed the complete genes sequences (with introns and exons) using an Unigene-guided DNA-seq assembly approach. This aimed to maximize the possibilities of finding polymorphism between genetically close individuals despite the lack of a reference genome. Finally, we used these resources for SNP mining within a collection of 16 commercial lavender clones and tested the SNP within the scope of a genetic distance analysis. We obtained a cleaned reference of 8, 030 functionally in silico annotated genes. We found 359K polymorphic sites and observed a high SNP frequency (mean of 1 SNP per 90 bp) and a high level of heterozygosity (more than 60% of heterozygous SNP per genotype). On overall, we found similar genetic distances between pairs of clones, which is probably related to the out-crossing nature of the species and the restricted area of cultivation. The proposed method is transferable to other orphan species, requires little bioinformatics resources and can be realized within a year. This is also the first reported large-scale SNP development on Lavandula angustifolia. All the genomics resources developed herein are publicly available and provide a rich pool of molecular resources to explore and exploit lavender genetic diversity in breeding programs.

Download Full-text

LeafGo: Leaf to Genome, a quick workflow to produce high-quality De novo genomes with Third Generation Sequencing technology

10.1101/2021.01.25.428044 ◽

2021 ◽

Author(s):

Patrick Driguez ◽

Salim Bougouffa ◽

Karen Carty ◽

Alexander Putra ◽

Kamel Jabbari ◽

...

Keyword(s):

De Novo ◽

Rapid Development ◽

Plant Genome ◽

Plant Genomics ◽

High Quality ◽

High Molecular Weight Dna ◽

Tissue Samples ◽

Sequencing Technologies ◽

The Cost ◽

New Generation

AbstractRecent years have witnessed a rapid development of sequencing technologies. Fundamental differences and limitations among various platforms impact the time, the cost and the accuracy for sequencing whole genomes. Here we designed a complete de novo plant genome generation workflow that starts from plant tissue samples and produces high-quality draft genomes with relatively modest laboratory and bioinformatic resources within seven days. To optimize our workflow we selected different species of plants which were used to extract high molecular weight DNA, to make PacBio and ONT libraries for sequencing with the Sequel I, Sequel II and GridION platforms. We assembled high-quality draft genomes of two different Eucalyptus species E. rudis, and E. camaldulensis to chromosome level without using additional scaffolding technologies. For the rapid production of de novo genome assembly of plant species we showed that our DNA extraction protocol followed by PacBio high fidelity sequencing, and assembly with new generation assemblers such as hifiasm produce excellent results. Our findings will be a valuable benchmark for groups planning wet- and dry-lab plant genomics research and for high throughput plant genomics initiatives.

Download Full-text

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Download Full-text

Direct-to-Consumer Genetic Testing

Genomics and Bioethics ◽

10.4018/978-1-61692-883-4.ch005 ◽

2011 ◽

pp. 51-84 ◽

Cited By ~ 1

Author(s):

Richard A. Stein

Keyword(s):

Human Genome ◽

Human Genome Project ◽

Cost Effective ◽

Helical Structure ◽

Genome Project ◽

Next Generation ◽

Double Helical Structure ◽

Sequencing Technologies ◽

Human Genome Sequencing ◽

The Human Genome Project

The 1953 discovery of the DNA double-helical structure by James Watson, Francis Crick, Maurice Wilkins, and Rosalind Franklin, represented one of the most significant advances in the biomedical world (Watson and Crick 1953; Maddox 2003). Almost half a century after this landmark event, in February 2001, the initial draft sequences of the human genome were published (Lander et al., 2001; Venter et al., 2001) and, in April 2003, the International Human Genome Sequencing Consortium reported the completion of the Human Genome Project, a massive international collaborative endeavor that started in 1990 and is thought to represent the most ambitious undertaking in the history of biology (Collins et al., 2003; Thangadurai, 2004; National Human Genome Research Institute). The Human Genome Project provided a plethora of genetic and genomic information that significantly changed our perspectives on biomedical and social sciences. The sequencing of the first human genome was a 13-year, 2.7-billion-dollar effort that relied on the automated Sanger (dideoxy or chain termination) method, which was developed in 1977, around the same time as the Maxam-Gilbert (chemical) sequencing, and subsequently became the most frequently used approach for several decades (Sanger et al., 1975; Maxam & Gilbert, 1977; Sanger et al., 1977). The new generations of DNA sequencing technologies, known as next-generation (second generation) and next-next-generation (third generation) sequencing, which started to be commercialized in 2005, enabled the cost-effective sequencing of large chromosomal regions during progressively shorter time frames, and opened the possibility for new applications, such as the sequencing of single-cell genomes (Service, 2006; Blow, 2008; Morozova and Marra, 2008; Metzker, 2010).

Download Full-text

Sampling the Waterhemp (Amaranthus tuberculatus) Genome Using Pyrosequencing Technology

Weed Science ◽

10.1614/ws-09-021.1 ◽

2009 ◽

Vol 57 (5) ◽

pp. 463-469 ◽

Cited By ~ 37

Author(s):

Ryan M. Lee ◽

Jyothi Thimmapuram ◽

Kate A. Thinglum ◽

George Gong ◽

Alvaro G. Hernandez ◽

...

Keyword(s):

Next Generation Sequencing ◽

Large Scale ◽

Ecological Model ◽

Average Length ◽

Science Research ◽

Complete Sequence ◽

Next Generation ◽

Next Generation Sequencing Technology ◽

Sequencing Technologies ◽

Generation Sequencing

Recent advances in sequencing technologies (next-generation sequencing) offer dramatically increased sequencing throughput at a lower cost than traditional Sanger sequencing. This technology is changing genomics research by allowing large scale sequencing experiments in nonmodel systems. Waterhemp is an important weed in the midwestern United States with characteristics that makes it an interesting ecological model. However, very few genomic resources are available for this species. One half of a 70 by 75 picotiter plate of 454-pyrosequencing was performed on total DNA isolated from waterhemp, generating 158,015 reads of an average length of 271 bp, or a total of nearly 43 Mbp of sequence. Included in this sequence was a nearly complete sequence of the chloroplast genome, sequences of several important herbicide resistance genes, leads for simple sequence repeat (SSR) markers, and a sampling of the repeated elements (e.g., transposons) present in this species. Here we present the waterhemp genomic data gleaned from this sequencing experiment and illustrate the value of next-generation sequencing technology to weed science research.

Download Full-text

A parallel computational framework for ultra-large-scale sequence clustering analysis

Bioinformatics ◽

10.1093/bioinformatics/bty617 ◽

2018 ◽

Vol 35 (3) ◽

pp. 380-388 ◽

Cited By ~ 2

Author(s):

Wei Zheng ◽

Qi Mao ◽

Robert J Genco ◽

Jean Wactawski-Wende ◽

Michael Buck ◽

...

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

De Novo ◽

Rapid Development ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Computational Framework ◽

Speed Up ◽

Scale Sequence

Abstract Motivation The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. Results In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method. Availability and implementation Open-source software for the proposed method is freely available at https://www.acsu.buffalo.edu/~yijunsun/lab/SLAD.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies

PLoS ONE ◽

10.1371/journal.pone.0017915 ◽

2011 ◽

Vol 6 (3) ◽

pp. e17915 ◽

Cited By ~ 144

Author(s):

Wenyu Zhang ◽

Jiajia Chen ◽

Yang Yang ◽

Yifei Tang ◽

Jing Shang ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genome Assembly ◽

De Novo ◽

Software Tools ◽

Next Generation ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Generation Sequencing ◽

Assembly Software

Download Full-text

RNA-seq: primary cells, cell lines and heat stress

10.1101/013979 ◽

2015 ◽

Author(s):

Carl J Schmdt ◽

Elizabeth M Pritchett ◽

Liang Sun ◽

Richard V.N. Davis ◽

Allen Hubbard ◽

...

Keyword(s):

Gene Expression ◽

Heat Stress ◽

Sequence Data ◽

Expression Patterns ◽

Effective Means ◽

Cost Effective ◽

Rna Seq ◽

Individual Gene ◽

Examine Gene Expression ◽

A Cell

Transcriptome analysis by RNA-seq has emerged as a high-throughput, cost-effective means to evaluate the expression pattern of genes in organisms. Unlike other methods, such as microarrays or quantitative PCR, RNA-seq is a target free method that permits analysis of essentially any RNA that can be amplified from a cell or tissue. At its most basic, RNA-seq can determine individual gene expression levels by counting the number of times a particular transcript was found in the sequence data. Transcript levels can be compared across multiple samples to identify differentially expressed genes and infer differences in biological states between the samples. We have used this approach to examine gene expression patterns in chicken and human cells, with particular interest in determining response to heat stress.

Download Full-text