SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis

Abstract Nanopore sequencing is an emerging genomic technology with great potential. However, the storage and analysis of nanopore sequencing data have become major bottlenecks preventing more widespread adoption in research and clinical genomics. Here, we elucidate an inherent limitation in the file format used to store raw nanopore data – known as FAST5 – that prevents efficient analysis on high-performance computing (HPC) systems. To overcome this we have developed SLOW5, an alternative file format that permits efficient parallelisation and, thereby, acceleration of nanopore data analysis. For example, we show that using SLOW5 format, instead of FAST5, reduces the time and cost of genome-wide DNA methylation profiling by an order of magnitude on common HPC systems, and delivers consistent improvements on a wide range of different architectures. With a simple, accessible file structure and a ~25% reduction in size compared to FAST5, SLOW5 format will deliver substantial benefits to all areas of the nanopore community.

Download Full-text

SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis

10.1101/2021.06.29.450255 ◽

2021 ◽

Author(s):

Hasindu Gamaarachchi ◽

Hiruna Samarakoon ◽

Sasha P. Jenner ◽

James M Ferguson ◽

Timothy G. Amos ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Nanopore Sequencing ◽

File Format ◽

Sequencing Data ◽

Genome Wide ◽

Wide Range ◽

File Structure ◽

Order Of Magnitude ◽

Methylation Profiling

Nanopore sequencing is an emerging genomic technology with great potential. However, the storage and analysis of nanopore sequencing data have become major bottlenecks preventing more widespread adoption in research and clinical genomics. Here, we elucidate an inherent limitation in the file format used to store raw nanopore data, known as FAST5, that prevents efficient analysis on high-performance computing (HPC) systems. To overcome this we have developed SLOW5, an alternative file format that permits efficient parallelisation and, thereby, acceleration of nanopore data analysis. For example, we show that using SLOW5 format, instead of FAST5, reduces the time and cost of genome-wide DNA methylation profiling by an order of magnitude on common HPC systems, and delivers consistent improvements on a wide range of different architectures. With a simple, accessible file structure and a ~25% reduction in size compared to FAST5, SLOW5 format will deliver substantial benefits to all areas of the nanopore community.

Download Full-text

Fast nanopore sequencing data analysis with SLOW5

Nature Biotechnology ◽

10.1038/s41587-021-01147-4 ◽

2022 ◽

Author(s):

Hasindu Gamaarachchi ◽

Hiruna Samarakoon ◽

Sasha P. Jenner ◽

James M. Ferguson ◽

Timothy G. Amos ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Computer Architectures ◽

Nanopore Sequencing ◽

File Format ◽

Sequencing Data ◽

High Performance Computer ◽

Sequencing Data Analysis ◽

Efficient Parallelization ◽

Methylation Profiling

AbstractNanopore sequencing depends on the FAST5 file format, which does not allow efficient parallel analysis. Here we introduce SLOW5, an alternative format engineered for efficient parallelization and acceleration of nanopore data analysis. Using the example of DNA methylation profiling of a human genome, analysis runtime is reduced from more than two weeks to approximately 10.5 h on a typical high-performance computer. SLOW5 is approximately 25% smaller than FAST5 and delivers consistent improvements on different computer architectures.

Download Full-text

Genome-wide Detection of Cytosine Methylations in Plant from Nanopore sequencing data using Deep Learning

10.1101/2021.02.07.430077 ◽

2021 ◽

Author(s):

Peng Ni ◽

Neng Huang ◽

Fan Nie ◽

Jun Zhang ◽

Zhi Zhang ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

High Performance ◽

Bisulfite Sequencing ◽

Dna Bases ◽

Nanopore Sequencing ◽

Computational Pipeline ◽

Sequencing Data ◽

Plant Genomes ◽

Genome Wide ◽

Low Coverage

AbstractMethylation states of DNA bases can be detected from native Nanopore reads directly. At present, there are many computational methods that can detect 5mCs in CpG contexts accurately by Nanopore sequencing. However, there is currently a lack of methods to detect 5mCs in non-CpG contexts. In this study, we propose a computational pipeline which can detect 5mC sites in both CpG and non-CpG contexts of plant genomes by using Nanopore sequencing. And we sequenced two model plants Arabidopsis thaliana (A. thaliana) and Oryza sativa (O. sativa) by using Nanopore sequencing and bisulfite sequencing. The results of our proposed pipeline in the two plants achieved high correlations with bisulfite sequencing: above 0.98, 0.96, 0.85 for CpG, CHG, and CHH (H indicates A, C or T) motif, respectively. Our proposed pipeline also achieved high performance on Brassica nigra (B. nigra). Experiments also showed that our proposed pipeline can achieve high performance even with low coverage of reads. Moreover, by using Nanopore sequencing, our proposed pipeline is capable of profiling methylation of more cytosines than bisulfite sequencing.

Download Full-text

Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa068 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Yang Liao ◽

Wei Shi

Keyword(s):

Data Analysis ◽

Pearson Correlation ◽

Rna Seq ◽

Genome Wide ◽

Gene Level ◽

Sequencing Quality ◽

Total Data ◽

Order Of Magnitude ◽

Gene Expression Quantification ◽

The Impact

Abstract RNA sequencing (RNA-seq) is currently the standard method for genome-wide expression profiling. RNA-seq reads often need to be mapped to a reference genome before read counts can be produced for genes. Read trimming methods have been developed to assist read mapping by removing adapter sequences and low-sequencing-quality bases. It is however unclear what is the impact of read trimming on the quantification of RNA-seq data, an important task in RNA-seq data analysis. In this study, we used a benchmark RNA-seq dataset and simulation data to assess the impact of read trimming on mapping and quantification of RNA-seq reads. We found that adapter sequences can be effectively removed by read aligner via ’soft-clipping’ and that many low-sequencing-quality bases, which would be removed by read trimming tools, were rescued by the aligner. Accuracy of gene expression quantification from using untrimmed reads was found to be comparable to or slightly better than that from using trimmed reads, based on Pearson correlation with reverse transcriptase-polymerase chain reaction data and simulation truth. Total data analysis time was reduced by up to an order of magnitude when read trimming was not performed. Our study suggests that read trimming is a redundant process in the quantification of RNA-seq expression data.

Download Full-text

Computational Approaches in Next-Generation Sequencing Data Analysis for Genome-Wide DNA Methylation Studies

Computational Methods for Next Generation Sequencing Data Analysis ◽

10.1002/9781119272182.ch9 ◽

2016 ◽

pp. 197-226

Author(s):

Jeong-Hyeon Choi ◽

Huidong Shi

Keyword(s):

Dna Methylation ◽

Data Analysis ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Computational Approaches ◽

Genome Wide ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

PyGNA: a unified framework for geneset network analysis

BMC Bioinformatics ◽

10.1186/s12859-020-03801-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Viola Fanfani ◽

Fabio Cassano ◽

Giovanni Stracquadanio

Keyword(s):

Data Analysis ◽

Network Analysis ◽

Biological Networks ◽

High Performance ◽

Large Scale ◽

Sequencing Data ◽

Unified Framework ◽

Null Distributions ◽

Viable Approach ◽

Easy Integration

Abstract Background Gene and protein interaction experiments provide unique opportunities to study the molecular wiring of a cell. Integrating high-throughput functional genomics data with this information can help identifying networks associated with complex diseases and phenotypes. Results Here we introduce an integrated statistical framework to test network properties of single and multiple genesets under different interaction models. We implemented this framework as an open-source software, called Python Geneset Network Analysis (PyGNA). Our software is designed for easy integration into existing analysis pipelines and to generate high quality figures and reports. We also developed PyGNA to take advantage of multi-core systems to generate calibrated null distributions on large datasets. We then present the results of extensive benchmarking of the tests implemented in PyGNA and a use case inspired by RNA sequencing data analysis, showing how PyGNA can be easily integrated to study biological networks. PyGNA is available at http://github.com/stracquadaniolab/pygna and can be easily installed using the PyPi or Anaconda package managers, and Docker. Conclusions We present a tool for network-aware geneset analysis. PyGNA can either be readily used and easily integrated into existing high-performance data analysis pipelines or as a Python package to implement new tests and analyses. With the increasing availability of population-scale omic data, PyGNA provides a viable approach for large scale geneset network analysis.

Download Full-text

Random forest based similarity learning for single cell RNA sequencing data

10.1101/258699 ◽

2018 ◽

Author(s):

Maziyar Baran Pouyan ◽

Dennis Kostka

Keyword(s):

Data Analysis ◽

Random Forest ◽

Single Cells ◽

R Package ◽

Similarity Learning ◽

Sequencing Data ◽

Genome Wide ◽

Step Procedure ◽

Exploratory Data ◽

Cell Cell

AbstractMotivationGenome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore obtaining accurate cell–cell similarities from scRNA-seq data is critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.ResultsHere we present RAFSIL, a random forest based approach to learn cell–cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization, and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.Availability and ImplementationThe RAFSIL R package is available online at www.kostkalab.net/software.html

Download Full-text

sangeranalyseR: simple and interactive analysis of Sanger sequencing data in R

10.1101/2020.05.18.102459 ◽

2020 ◽

Author(s):

Kuan-Hao Chao ◽

Kirston Barton ◽

Sarah Palmer ◽

Robert Lanfear

Keyword(s):

Sanger Sequencing ◽

Reference Sequence ◽

Supplementary Information ◽

File Format ◽

Bioconductor Package ◽

Sequencing Data ◽

Interactive Analysis ◽

Link Type ◽

Online Documentation ◽

Wide Range

AbstractSummarysangeranalyseR is an interactive R/Bioconductor package and two associated Shiny applications designed for analysing Sanger sequencing from data from the ABIF file format in R. It allows users to go from loading reads to saving aligned contigs in a few lines of R code. sangeranalyseR provides a wide range of options for a number of commonly-performed actions including read trimming, detecting secondary peaks, viewing chromatograms, and detecting indels using a reference sequence. All parameters can be adjusted interactively either in R or in the associated Shiny applications. sangeranalyseR comes with extensive online documentation, and outputs detailed interactive HTML reports.Availability and implementationsangeranalyseR is implemented in R and released under an MIT license. It is available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR) and on Github (https://github.com/roblanf/sangeranalyseR)[email protected] informationDocumentation at https://sangeranalyser.readthedocs.io/.

Download Full-text

Identification of polymorphic and off-target probe binding sites on the Illumina Infinium MethylationEPIC BeadChip

10.1101/056937 ◽

2016 ◽

Author(s):

Daniel L. McCartney ◽

Rosie M. Walker ◽

Stewart W. Morris ◽

Andrew M. McIntosh ◽

David J. Porteous ◽

...

Keyword(s):

Dna Methylation ◽

Genome Wide ◽

Wide Range ◽

Control Procedures ◽

Target Probe ◽

Genomic Regions ◽

Array Performance ◽

The Relationship ◽

Inexpensive Technique ◽

Methylation Profiling

AbstractGenome-wide analysis of DNA methylation has now become a relatively inexpensive technique thanks to array-based methylation profiling technologies. The recently developed Illumina Infinium MethylationEPIC BeadChip interrogates methylation at over 850,000 sites across the human genome, covering 99% of RefSeq genes. This array supersedes the widely used Infinium HumanMethylation450 BeadChip, which has permitted insights into the relationship between DNA methylation and a wide range of conditions and traits. Previous research has identified issues with certain probes on both the HumanMethylation450 BeadChip and its predecessor, the Infinium HumanMethylation27 BeadChip, which were predicted to affect array performance. These issues concerned probe-binding specificity and the presence of polymorphisms at target sites. Using in silico methods, we have identified probes on the Infinium MethylationEPIC BeadChip that are predicted to (i) measure methylation at polymorphic sites and (ii) hybridise to multiple genomic regions. We intend these resources to be used for quality control procedures when analysing data derived from this platform.

Download Full-text

Enhanced terahertz detection of multigate graphene nanostructures

Nanophotonics ◽

10.1515/nanoph-2021-0573 ◽

2022 ◽

Vol 0 (0) ◽

Author(s):

Juan A. Delgado-Notario ◽

Wojciech Knap ◽

Vito Clericò ◽

Juan Salvador-Sánchez ◽

Jaime Calvo-Gallego ◽

...

Keyword(s):

High Performance ◽

Room Temperature ◽

Thz Radiation ◽

Electron Hole ◽

Back Gate ◽

Potential Barriers ◽

Terahertz Detection ◽

Wide Range ◽

Order Of Magnitude ◽

Effect Transistor

Abstract Terahertz (THz) waves have revealed a great potential for use in various fields and for a wide range of challenging applications. High-performance detectors are, however, vital for exploitation of THz technology. Graphene plasmonic THz detectors have proven to be promising optoelectronic devices, but improving their performance is still necessary. In this work, an asymmetric-dual-grating-gate graphene-terahertz-field-effect-transistor with a graphite back-gate was fabricated and characterized under illumination of 0.3 THz radiation in the temperature range from 4.5 K up to the room temperature. The device was fabricated as a sub-THz detector using a heterostructure of h-BN/Graphene/h-BN/Graphite to make a transistor with a double asymmetric-grating-top-gate and a continuous graphite back-gate. By biasing the metallic top-gates and the graphite back-gate, abrupt n+n (or p+p) or np (or pn) junctions with different potential barriers are formed along the graphene layer leading to enhancement of the THz rectified signal by about an order of magnitude. The plasmonic rectification for graphene containing np junctions is interpreted as due to the plasmonic electron-hole ratchet mechanism, whereas, for graphene with n+n junctions, rectification is attributed to the differential plasmonic drag effect. This work shows a new way of responsivity enhancement and paves the way towards new record performances of graphene THz nano-photodetectors.

Download Full-text