Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy

Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Srivatsan Sridhar ◽  
Tsachy Weissman

Abstract Motivation Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. Results We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. Availabilityand implementation The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Srivatsan Sridhar ◽  
Tsachy Weissman

AbstractMotivationNanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications.ResultsWe explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35-50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications.AvailabilityThe code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation.Supplementary informationSupplementary data are available at Bioinformatics [email protected]


Author(s):  
Kexin Huang ◽  
Tianfan Fu ◽  
Lucas M Glass ◽  
Marinka Zitnik ◽  
Cao Xiao ◽  
...  

Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (14) ◽  
pp. 2498-2500 ◽  
Author(s):  
Ehsaneddin Asgari ◽  
Philipp C Münch ◽  
Till R Lesker ◽  
Alice C McHardy ◽  
Mohammad R K Mofrad

Abstract Summary Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets. Availability and implementation DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (18) ◽  
pp. 4675-4681 ◽  
Author(s):  
Yuansheng Liu ◽  
Limsoon Wong ◽  
Jinyan Li

Abstract Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. Availability and implementation https://github.com/yuansliu/memRGC. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (1) ◽  
pp. 295-302 ◽  
Author(s):  
Leon Weber ◽  
Jannes Münchmeyer ◽  
Tim Rocktäschel ◽  
Maryam Habibi ◽  
Ulf Leser

Abstract Motivation Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora. Results We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5–13 pp on the entity types chemicals, species and genes. Availability and implementation HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (21) ◽  
pp. 4445-4447 ◽  
Author(s):  
Roberto Semeraro ◽  
Alberto Magi

Abstract Motivation The recent technological improvement of Oxford Nanopore sequencing pushed the throughput of these devices to 10–20 Gb allowing the generation of millions of reads. For these reasons, the availability of fast software packages for evaluating experimental quality by generating highly informative and interactive summary plots is of fundamental importance. Results We developed PyPore, a three module python toolbox designed to handle raw FAST5 files from quality checking to alignment to a reference genome and to explore their features through the generation of browsable HTML files. The first module provides an interface to explore and evaluate the information contained in FAST5 and summarize them into informative quality measures. The second module converts raw data in FASTQ format, while the third module allows to easily use three state-of-the-art aligners and collects mapping statistics. Availability and implementation PyPore is an open-source software and is written in Python2.7, source code is freely available, for all OS platforms, in Github at https://github.com/rsemeraro/PyPore Supplementary information Supplementary data are available at Bioinformatics online.


Sensors ◽  
2019 ◽  
Vol 19 (4) ◽  
pp. 810 ◽  
Author(s):  
Erzhuo Che ◽  
Jaehoon Jung ◽  
Michael Olsen

Mobile Laser Scanning (MLS) is a versatile remote sensing technology based on Light Detection and Ranging (lidar) technology that has been utilized for a wide range of applications. Several previous reviews focused on applications or characteristics of these systems exist in the literature, however, reviews of the many innovative data processing strategies described in the literature have not been conducted in sufficient depth. To this end, we review and summarize the state of the art for MLS data processing approaches, including feature extraction, segmentation, object recognition, and classification. In this review, we first discuss the impact of the scene type to the development of an MLS data processing method. Then, where appropriate, we describe relevant generalized algorithms for feature extraction and segmentation that are applicable to and implemented in many processing approaches. The methods for object recognition and point cloud classification are further reviewed including both the general concepts as well as technical details. In addition, available benchmark datasets for object recognition and classification are summarized. Further, the current limitations and challenges that a significant portion of point cloud processing techniques face are discussed. This review concludes with our future outlook of the trends and opportunities of MLS data processing algorithms and applications.


Metals ◽  
2021 ◽  
Vol 11 (2) ◽  
pp. 250
Author(s):  
Jiří Hájek ◽  
Zaneta Dlouha ◽  
Vojtěch Průcha

This article is a response to the state of the art in monitoring the cooling capacity of quenching oils in industrial practice. Very often, a hardening shop requires a report with data on the cooling process for a particular quenching oil. However, the interpretation of the data can be rather difficult. The main goal of our work was to compare various criteria used for evaluating quenching oils. Those of which prove essential for operation in tempering plants would then be introduced into practice. Furthermore, the article describes monitoring the changes in the properties of a quenching oil used in a hardening shop, the effects of quenching oil temperature on its cooling capacity and the impact of the water content on certain cooling parameters of selected oils. Cooling curves were measured (including cooling rates and the time to reach relevant temperatures) according to ISO 9950. The hardening power of the oil and the area below the cooling rate curve as a function of temperature (amount of heat removed in the nose region of the Continuous cooling transformation - CCT curve) were calculated. V-values based on the work of Tamura, reflecting the steel type and its CCT curve, were calculated as well. All the data were compared against the hardness and microstructure on a section through a cylinder made of EN C35 steel cooled in the particular oil. Based on the results, criteria are recommended for assessing the suitability of a quenching oil for a specific steel grade and product size. The quenching oils used in the experiment were Houghto Quench C120, Paramo TK 22, Paramo TK 46, CS Noro MO 46 and Durixol W72.


Author(s):  
Florian Kuisat ◽  
Fernando Lasagni ◽  
Andrés Fabián Lasagni

AbstractIt is well known that the surface topography of a part can affect its mechanical performance, which is typical in additive manufacturing. In this context, we report about the surface modification of additive manufactured components made of Titanium 64 (Ti64) and Scalmalloy®, using a pulsed laser, with the aim of reducing their surface roughness. In our experiments, a nanosecond-pulsed infrared laser source with variable pulse durations between 8 and 200 ns was applied. The impact of varying a large number of parameters on the surface quality of the smoothed areas was investigated. The results demonstrated a reduction of surface roughness Sa by more than 80% for Titanium 64 and by 65% for Scalmalloy® samples. This allows to extend the applicability of additive manufactured components beyond the current state of the art and break new ground for the application in various industrial applications such as in aerospace.


2021 ◽  
Vol 16 (1) ◽  
pp. 1-23
Author(s):  
Min-Ling Zhang ◽  
Jun-Peng Fang ◽  
Yi-Bo Wang

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.


Sign in / Sign up

Export Citation Format

Share Document