Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing

Abstract Background Systems biology increasingly relies on deep sequencing with combinatorial index tags to associate biological sequences with their sample, cell, or molecule of origin. Accurate data interpretation depends on the ability to classify sequences based on correct decoding of these combinatorial barcodes. The probability of correct decoding is influenced by both sequence quality and the number and arrangement of barcodes. The rising complexity of experimental designs calls for a probability model that accounts for both sequencing errors and random noise, generalizes to multiple combinatorial tags, and can handle any barcoding scheme. The needs for reproducibility and community benchmark standards demand a peer-reviewed tool that preserves decoding quality scores and provides tunable control over classification confidence that balances precision and recall. Moreover, continuous improvements in sequencing throughput require a fast, parallelized and scalable implementation. Results and discussion We developed a flexible, robustly engineered software that performs probabilistic decoding and supports arbitrarily complex barcoding designs. Pheniqs computes the full posterior decoding error probability of observed barcodes by consulting basecalling quality scores and prior distributions, and reports sequences and confidence scores in Sequence Alignment/Map (SAM) fields. The product of posteriors for multiple independent barcodes provides an overall confidence score for each read. Pheniqs achieves greater accuracy than minimum edit distance or simple maximum likelihood estimation, and it scales linearly with core count to enable the classification of > 11 billion reads in 1 h 15 m using < 50 megabytes of memory. Pheniqs has been in production use for seven years in our genomics core facility. Conclusion We introduce a computationally efficient software that implements both probabilistic and minimum distance decoders and show that decoding barcodes using posterior probabilities is more accurate than available methods. Pheniqs allows fine-tuning of decoding sensitivity using intuitive confidence thresholds and is extensible with alternative decoders and new error models. Any arbitrary arrangement of barcodes is easily configured, enabling computation of combinatorial confidence scores for any barcoding strategy. An optimized multithreaded implementation assures that Pheniqs is faster and scales better with complex barcode sets than existing tools. Support for POSIX streams and multiple sequencing formats enables easy integration with automated analysis pipelines.

Download Full-text

Pheniqs 2.0: accurate, high performance Bayesian decoding and confidence estimation for combinatorial barcode indexing

10.1101/2021.03.11.434956 ◽

2021 ◽

Author(s):

Lior Galanti ◽

Dennis Shasha ◽

Kristin C. Gunsalus

Keyword(s):

High Performance ◽

Probability Model ◽

Random Noise ◽

Likelihood Estimation ◽

Automated Analysis ◽

Confidence Score ◽

Data Interpretation ◽

Fine Tuning ◽

Confidence Estimation ◽

Sequencing Errors

AbstractBackgroundSystems biology increasingly relies on deep sequencing with combinatorial index tags to associate biological sequences with their sample, cell, or molecule of origin. Accurate data interpretation depends on the ability to classify sequences based on correct decoding of these combinatorial barcodes. The probability of correct decoding is influenced by both sequence quality and the number and arrangement of barcodes. The rising complexity of experimental designs calls for a probability model that accounts for both sequencing errors and random noise, generalizes to multiple combinatorial tags, and can handle any barcoding scheme. The needs for reproducibility and community benchmark standards demand a peer-reviewed tool that preserves decoding quality scores and provides tunable control over classification confidence that balances precision and recall. Moreover, continuous improvements in sequencing throughput require a fast, parallelized and scalable implementation.ResultsWe developed a flexible, robustly engineered software that performs probabilistic decoding and supports arbitrarily complex barcoding designs. Pheniqs computes the full posterior decoding error probability of observed barcodes by consulting basecalling quality scores and prior distributions, and reports sequences and confidence scores in Sequence Alignment/Map (SAM) fields. The product of posteriors for multiple independent barcodes provides an overall confidence score for each read. Pheniqs achieves greater accuracy than minimum edit distance or simple maximum likelihood estimation, and it scales linearly with core count to enable the classification of >11 billion reads in 1h15m using <50 megabytes of memory. Pheniqs has been in production use for seven years in our genomics core facility.ConclusionsWe introduce a computationally efficient software that implements both probabilistic and minimum distance decoders and show that decoding barcodes using posterior probabilities is more accurate than available methods. Pheniqs allows fine-tuning of decoding sensitivity using intuitive confidence thresholds and is extensible with alternative decoders and new error models. Any arbitrary arrangement of barcodes is easily configured, enabling computation of combinatorial confidence scores for any barcoding strategy. An optimized multithreaded implementation assures that Pheniqs is faster and scales better with complex barcode sets than existing tools. Support for POSIX streams and multiple sequencing formats enables easy integration with automated analysis pipelines.

Download Full-text

Deep Learning Frameworks for Rapid Gram Stain Image Data Interpretation: Protocol for a Retrospective Data Analysis (Preprint)

10.2196/preprints.16843 ◽

2020 ◽

Author(s):

Hee Kim ◽

Thomas Ganslandt ◽

Thomas Miethke ◽

Michael Neumaier ◽

Maximilian Kittel

Keyword(s):

Deep Learning ◽

High Performance ◽

Image Interpretation ◽

Image Data ◽

Data Interpretation ◽

Routine Care ◽

Fine Tuning ◽

Learning Technology ◽

Gram Stain ◽

Computational Performance

BACKGROUND In recent years, remarkable progress has been made in deep learning technology and successful use cases have been introduced in the medical domain. However, not many studies have considered high-performance computing to fully appreciate the capability of deep learning technology. OBJECTIVE This paper aims to design a solution to accelerate an automated Gram stain image interpretation by means of a deep learning framework without additional hardware resources. METHODS We will apply and evaluate 3 methodologies, namely fine-tuning, an integer arithmetic–only framework, and hyperparameter tuning. RESULTS The choice of pretrained models and the ideal setting for layer tuning and hyperparameter tuning will be determined. These results will provide an empirical yet reproducible guideline for those who consider a rapid deep learning solution for Gram stain image interpretation. The results are planned to be announced in the first quarter of 2021. CONCLUSIONS Making a balanced decision between modeling performance and computational performance is the key for a successful deep learning solution. Otherwise, highly accurate but slow deep learning solutions can add value to routine care. INTERNATIONAL REGISTERED REPORT DERR1-10.2196/16843

Download Full-text

Deep Learning Frameworks for Rapid Gram Stain Image Data Interpretation: Protocol for a Retrospective Data Analysis

JMIR Research Protocols ◽

10.2196/16843 ◽

2020 ◽

Vol 9 (7) ◽

pp. e16843

Author(s):

Hee Kim ◽

Thomas Ganslandt ◽

Thomas Miethke ◽

Michael Neumaier ◽

Maximilian Kittel

Keyword(s):

Deep Learning ◽

High Performance ◽

Image Interpretation ◽

Image Data ◽

Data Interpretation ◽

Routine Care ◽

Fine Tuning ◽

Learning Technology ◽

Gram Stain ◽

Computational Performance

Background In recent years, remarkable progress has been made in deep learning technology and successful use cases have been introduced in the medical domain. However, not many studies have considered high-performance computing to fully appreciate the capability of deep learning technology. Objective This paper aims to design a solution to accelerate an automated Gram stain image interpretation by means of a deep learning framework without additional hardware resources. Methods We will apply and evaluate 3 methodologies, namely fine-tuning, an integer arithmetic–only framework, and hyperparameter tuning. Results The choice of pretrained models and the ideal setting for layer tuning and hyperparameter tuning will be determined. These results will provide an empirical yet reproducible guideline for those who consider a rapid deep learning solution for Gram stain image interpretation. The results are planned to be announced in the first quarter of 2021. Conclusions Making a balanced decision between modeling performance and computational performance is the key for a successful deep learning solution. Otherwise, highly accurate but slow deep learning solutions can add value to routine care. International Registered Report Identifier (IRRID) DERR1-10.2196/16843

Download Full-text

Pulsed Electromagnetic Cross-Well Exploration for Monitoring Permafrost and Examining the Processes of Its Geocryological Changes

Geosciences ◽

10.3390/geosciences11020060 ◽

2021 ◽

Vol 11 (2) ◽

pp. 60

Author(s):

Viacheslav Glinskikh ◽

Oleg Nechaev ◽

Igor Mikhaylov ◽

Kirill Danilovskiy ◽

Vladimir Olenchenko

Keyword(s):

High Performance ◽

Geophysical Methods ◽

Three Dimensional ◽

High Sensitivity ◽

Data Interpretation ◽

Industrial Facilities ◽

Wide Range ◽

Vector Finite Element ◽

A New Technique ◽

Pulsed Electromagnetic

This paper is dedicated to the topical problem of examining permafrost’s state and the processes of its geocryological changes by means of geophysical methods. To monitor the cryolithozone, we proposed and scientifically substantiated a new technique of pulsed electromagnetic cross-well sounding. Based on the vector finite-element method, we created a mathematical model of the cross-well sounding process with a pulsed source in a three-dimensional spatially heterogeneous medium. A high-performance parallel computing algorithm was developed and verified. Through realistic geoelectric models of permafrost with a talik under a highway, constructed following the results of electrotomography field data interpretation, we numerically simulated the pulsed sounding on the computing resources of the Siberian Supercomputer Center of SB RAS. The simulation results suggest the proposed system of pulsed electromagnetic cross-well monitoring to be characterized by a high sensitivity to the presence and dimensions of the talik. The devised approach can be oriented to addressing a wide range of issues related to monitoring permafrost rocks under civil and industrial facilities, buildings, and constructions.

Download Full-text

Markov-Modulated Continuous-Time Markov Chains to Identify Site- and Branch-Specific Evolutionary Variation in BEAST

Systematic Biology ◽

10.1093/sysbio/syaa037 ◽

2020 ◽

Vol 70 (1) ◽

pp. 181-189

Author(s):

Guy Baele ◽

Mandev S Gill ◽

Paul Bastide ◽

Philippe Lemey ◽

Marc A Suchard

Keyword(s):

High Performance ◽

Markov Models ◽

Marginal Likelihood ◽

Likelihood Estimation ◽

Phylogenetic Inference ◽

Time Variability ◽

Substitution Process ◽

Wide Range ◽

Markov Modulated ◽

Over Time

Abstract Markov models of character substitution on phylogenies form the foundation of phylogenetic inference frameworks. Early models made the simplifying assumption that the substitution process is homogeneous over time and across sites in the molecular sequence alignment. While standard practice adopts extensions that accommodate heterogeneity of substitution rates across sites, heterogeneity in the process over time in a site-specific manner remains frequently overlooked. This is problematic, as evolutionary processes that act at the molecular level are highly variable, subjecting different sites to different selective constraints over time, impacting their substitution behavior. We propose incorporating time variability through Markov-modulated models (MMMs), which extend covarion-like models and allow the substitution process (including relative character exchange rates as well as the overall substitution rate) at individual sites to vary across lineages. We implement a general MMM framework in BEAST, a popular Bayesian phylogenetic inference software package, allowing researchers to compose a wide range of MMMs through flexible XML specification. Using examples from bacterial, viral, and plastid genome evolution, we show that MMMs impact phylogenetic tree estimation and can substantially improve model fit compared to standard substitution models. Through simulations, we show that marginal likelihood estimation accurately identifies the generative model and does not systematically prefer the more parameter-rich MMMs. To mitigate the increased computational demands associated with MMMs, our implementation exploits recent developments in BEAGLE, a high-performance computational library for phylogenetic inference. [Bayesian inference; BEAGLE; BEAST; covarion, heterotachy; Markov-modulated models; phylogenetics.]

Download Full-text

Optimal Data-Driven Modeling-Free Differential-Inversion-Based Iterative Control: A Wafer Stage Example

Volume 2: Intelligent Transportation/Vehicles; Manufacturing; Mechatronics; Engine/After-Treatment Systems; Soft Actuators/Manipulators; Modeling/Validation; Motion/Vibration Control Applications; Multi-Agent/Networked Systems; Path Planning/Motion Control; Renewable/Smart Energy Systems; Security/Privacy of Cyber-Physical Systems; Sensors/Actuators; Tracking Control Systems; Unmanned Ground/Aerial Vehicles; Vehicle Dynamics, Estimation, Control; Vibration/Control Systems; Vibrations ◽

10.1115/dscc2020-3281 ◽

2020 ◽

Author(s):

Zezhou Zhang ◽

Qingze Zou

Keyword(s):

System Dynamics ◽

High Performance ◽

Random Noise ◽

High Accuracy ◽

Data Driven ◽

Modeling Process ◽

Random Disturbances ◽

Noise Disturbances ◽

Data Driven Modeling ◽

Iterative Control

Abstract In this paper, an optimal data-driven modeling-free differential-inversion-based iterative control (OMFDIIC) method is proposed for both high performance and robustness in the presence of random disturbances. Achieving high accuracy and fast convergence is challenging as the system dynamics behaviors vary due to the external uncertainties and the system bandwidth is limited. The aim of the proposed method is to compensate for the dynamics effect without modeling process and achieve both high accuracy and robust convergence, by extending the existed modeling-free differential-inversion-based iterative control (MFDIIC) method through a frequency- and iteration-dependent gain. The convergence of the OMFDIIC method is analyzed with random noise/disturbances considered. The developed method is applied to a wafer stage, and shows a significant improvement in the performance.

Download Full-text

Deep Learning of COVID-19 Chest X-Rays: New Models or Fine Tuning?

10.36227/techrxiv.12656948.v1 ◽

2020 ◽

Author(s):

Tuan Pham

Keyword(s):

Deep Learning ◽

High Performance ◽

Data Augmentation ◽

Dominant Role ◽

Care Center ◽

Characteristic Curve ◽

Fine Tuning ◽

Urgent Care ◽

X Rays ◽

Training Time

Chest X-rays have been found to be very promising for assessing COVID-19 patients, especially for resolving emergency-department and urgent-care-center overcapacity. Deep-learning (DL) methods in artificial intelligence (AI) play a dominant role as high-performance classifiers in the detection of the disease using chest X-rays. While many new DL models have been being developed for this purpose, this study aimed to investigate the fine tuning of pretrained convolutional neural networks (CNNs) for the classification of COVID-19 using chest X-rays. Three pretrained CNNs, which are AlexNet, GoogleNet, and SqueezeNet, were selected and fine-tuned without data augmentation to carry out 2-class and 3-class classification tasks using 3 public chest X-ray databases. In comparison with other recently developed DL models, the 3 pretrained CNNs achieved very high classification results in terms of accuracy, sensitivity, specificity, precision, F1 score, and area under the receiver-operating-characteristic curve. AlexNet, GoogleNet, and SqueezeNet require the least training time among pretrained DL models, but with suitable selection of training parameters, excellent classification results can be achieved without data augmentation by these networks. The findings contribute to the urgent need for harnessing the pandemic by facilitating the deployment of AI tools that are fully automated and readily available in the public domain for rapid implementation.

Download Full-text

Accurate Filtering of Privacy-Sensitive Information in Raw Genomic Data

10.1101/292185 ◽

2018 ◽

Author(s):

Jérémie Decouchant ◽

Maria Fernandes ◽

Marcus Völp ◽

Francisco M Couto ◽

Paulo Esteves-Veríssimo

Keyword(s):

High Performance ◽

Genomic Data ◽

Sensitive Information ◽

Sensitive Data ◽

Variable Regions ◽

Fine Grained ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads

AbstractSequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.

Download Full-text

Calibration of MEMS Triaxial Accelerometers Based on the Maximum Likelihood Estimation Method

Mathematical Problems in Engineering ◽

10.1155/2020/4617365 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Yifan Sun ◽

Xiang Xu

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimation ◽

Random Noise ◽

Estimation Method ◽

Likelihood Estimation ◽

Least Square ◽

Triaxial Accelerometer ◽

Zero Bias ◽

Estimation Function ◽

Maximum Likelihood Estimation Method

As a widely used inertial device, a MEMS triaxial accelerometer has zero-bias error, nonorthogonal error, and scale-factor error due to technical defects. Raw readings without calibration might seriously affect the accuracy of inertial navigation system. Therefore, it is necessary to conduct calibration processing before using a MEMS triaxial accelerometer. This paper presents a MEMS triaxial accelerometer calibration method based on the maximum likelihood estimation method. The error of the MEMS triaxial accelerometer comes into question, and the optimal estimation function is established. The calibration parameters are obtained by the Newton iteration method, which is more efficient and accurate. Compared with the least square method, which estimates the parameters of the suboptimal estimation function established under the condition of assuming that the mean of the random noise is zero, the parameters calibrated by the maximum likelihood estimation method are more accurate and stable. Moreover, the proposed method has low computation, which is more functional. Simulation and experimental results using the consumer low-cost MEMS triaxial accelerometer are presented to support the abovementioned superiorities of the maximum likelihood estimation method. The proposed method has the potential to be applied to other triaxial inertial sensors.

Download Full-text

HIERARCHICAL MAPPING FOR HPC APPLICATIONS

Parallel Processing Letters ◽

10.1142/s0129626411000229 ◽

2011 ◽

Vol 21 (03) ◽

pp. 279-299 ◽

Cited By ~ 1

Author(s):

I-HSIN CHUNG ◽

CHE-RUNG LEE ◽

JIAZHENG ZHOU ◽

YEH-CHING CHUNG

Keyword(s):

High Performance ◽

Large Scale ◽

Scale Up ◽

Matrix Multiplication ◽

Spectral Graph Theory ◽

Communication Patterns ◽

Fine Tuning ◽

Mapping Algorithm ◽

Communication Time ◽

Run Time

As the high performance computing systems scale up, mapping the tasks of a parallel application onto physical processors to allow efficient communication becomes one of the critical performance issues. Existing algorithms were usually designed to map applications with regular communication patterns. Their mapping criterion usually overlooks the size of communicated messages, which is the primary factor of communication time. In addition, most of their time complexities are too high to process large scale problems. In this paper, we present a hierarchical mapping algorithm (HMA), which is capable of mapping applications with irregular communication patterns. It first partitions tasks according to their run-time communication information. The tasks that communicate with each other more frequently are regarded as strongly connected. Based on their connectivity strength, the tasks are partitioned into supernodes based on the algorithms in spectral graph theory. The hierarchical partitioning reduces the mapping algorithm complexity to achieve scalability. Finally, the run-time communication information will be used again in fine tuning to explore better mappings. With the experiments, we show how the mapping algorithm helps to reduce the point-to-point communication time for the PDGEMM, a ScaLAPACK matrix multiplication computation kernel, up to 20% and the AMG2006, a tier 1 application of the Sequoia benchmark, up to 7%.

Download Full-text