PeakBot: Machine learning based chromatographic peak picking

AbstractMotivationChromatographic peak picking is among the first steps in software pipelines for processing LC-HRMS datasets in untargeted metabolomics applications. Its performance is crucial for the holistic detection of all metabolic features as well as their relative quantification for statistical analysis and metabolite identification. Unfortunately, random noise, non-baseline separated compounds and unspecific background signals complicate this task.ResultsA machine-learning framework entitled PeakBot was developed for detecting chromatographic peaks in LC-HRMS profile-mode data. It first detects all local signal maxima in a chromatogram, which are then extracted as super-sampled standardized areas (retention time vs. m/z). These are subsequently inspected by a custom-trained convolutional neural network that forms the basis of PeakBot’s architecture. The model reports if the respective local maximum is the apex of a chromatographic peak or not as well as its peak center and bounding box.In independent training and validation datasets used for development, PeakBot achieved a high performance with respect to discriminating between chromatographic peaks and background signals (F1 score of 0.99). A comparison of different sets of reference features showed that at least 100 reference features (including isotopologs) should be provided to achieve high-quality results for detecting new chromatographic peaks.PeakBot is implemented in Python (3.8) and uses the TensorFlow (2.4.1) package for machine-learning related tasks. It has been tested on Linux and Windows OSs.AvailabilityThe framework is available free of charge for non-commercial use (CC BY-NC-SA). It is available at https://github.com/christophuv/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Performance and scaling behavior of bioinformatic applications in virtualization environments to create awareness for the efficient use of compute resources

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009244 ◽

2021 ◽

Vol 17 (7) ◽

pp. e1009244

Author(s):

Maximilian Hanussek ◽

Felix Bartusch ◽

Jens Krüger

Keyword(s):

Machine Learning ◽

Virtual Environments ◽

High Performance ◽

Biological Data ◽

Scaling Behavior ◽

Bare Metal ◽

Learning Framework ◽

Speed Up ◽

Clustal Omega ◽

Performance Computing

The large amount of biological data available in the current times, makes it necessary to use tools and applications based on sophisticated and efficient algorithms, developed in the area of bioinformatics. Further, access to high performance computing resources is necessary, to achieve results in reasonable time. To speed up applications and utilize available compute resources as efficient as possible, software developers make use of parallelization mechanisms, like multithreading. Many of the available tools in bioinformatics offer multithreading capabilities, but more compute power is not always helpful. In this study we investigated the behavior of well-known applications in bioinformatics, regarding their performance in the terms of scaling, different virtual environments and different datasets with our benchmarking tool suite BOOTABLE. The tool suite includes the tools BBMap, Bowtie2, BWA, Velvet, IDBA, SPAdes, Clustal Omega, MAFFT, SINA and GROMACS. In addition we added an application using the machine learning framework TensorFlow. Machine learning is not directly part of bioinformatics but applied to many biological problems, especially in the context of medical images (X-ray photographs). The mentioned tools have been analyzed in two different virtual environments, a virtual machine environment based on the OpenStack cloud software and in a Docker environment. The gained performance values were compared to a bare-metal setup and among each other. The study reveals, that the used virtual environments produce an overhead in the range of seven to twenty-five percent compared to the bare-metal environment. The scaling measurements showed, that some of the analyzed tools do not benefit from using larger amounts of computing resources, whereas others showed an almost linear scaling behavior. The findings of this study have been generalized as far as possible and should help users to find the best amount of resources for their analysis. Further, the results provide valuable information for resource providers to handle their resources as efficiently as possible and raise the user community’s awareness of the efficient usage of computing resources.

Download Full-text

Machine learning framework for predicting failure mode and shear capacity of ultra high performance concrete beams

Engineering Structures ◽

10.1016/j.engstruct.2020.111221 ◽

2020 ◽

Vol 224 ◽

pp. 111221 ◽

Cited By ~ 2

Author(s):

Roya Solhmirzaei ◽

Hadi Salehi ◽

Venkatesh Kodur ◽

M.Z. Naser

Keyword(s):

Machine Learning ◽

Failure Mode ◽

High Performance ◽

High Performance Concrete ◽

Concrete Beams ◽

Shear Capacity ◽

Ultra High Performance Concrete ◽

Learning Framework

Download Full-text

Framework to select the top ML classifier for robust seizure detection and prediction: A comparison-based study using multiple preictal time and feature sets

10.21203/rs.3.rs-470605/v1 ◽

2021 ◽

Author(s):

Rajlakshmi Borthakur ◽

Amit Kumar ◽

Upendra Kumar Jena ◽

Dibya Jyoti Borthakur

Keyword(s):

Machine Learning ◽

High Performance ◽

False Positive Rate ◽

Onset Time ◽

Field Performance ◽

Seizure Detection ◽

Learning Framework ◽

Prediction Time ◽

Seizure Models ◽

Unseen Data

Abstract We present a machine learning framework aimed at increasing performance seizure detection systems. Despite the diversity of ML-based available, the methodology to select an optimal classifier or ensemble model for seizure research is not commonly known. Our study is aimed at bridging this gap by showing use of a statistically guided machine learning framework that has delivered reliable on-field performance. We empirically examined performances of 11 different classification algorithms, with additional 13 variants, executing 300 machine learning models against 6 fixed preictal windows (ranging from 5 minutes to 60 minutes) and 3 different sets of features as part of our study. Using a base of 55 subjects, of which we used 43 subjects to train and validate multiple ensemble models and classifiers, we achieved upto 98.3% F1 score, with 94.8% chance level performance on our data, and a False Positive Rate (FPR) of 0.0156/hr. We applied the top ensembles and classifier variants on 12 completely new subjects, unseen by the models to test their on-field efficacy. One of our models detected all 12 seizures on the unseen data and predicted 11 of them before electrographic onset. Average prediction time of models across all subjects was 31.56 minutes. At an individual level the earliest prediction was recorded at 176 minutes, which is the highest known seizure prediction time recorded in literature. Our analysis showed that a 30 minute window could be the most ideal preictal window for training seizure models. The significance of our work is that we are reporting the highest F1 score of 98.3% on seizure data so far recorded with sample size of higher than 10. We are able to show high performance of seizure models using wearables data, detecting seizures at par with neurologist verified EEG onset time, and predicting seizures up to 176 minutes before onset, which is significantly higher than previously reported scores. Our work lays important foundations for using wearables for actual monitoring of PWEs in out-of-hospital settings, which might be an important step in ensuring their safety and quality of life.

Download Full-text

Hexagonal Image Processing in the Context of Machine Learning: Conception of a Biologically Inspired Hexagonal Deep Learning Framework

2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) ◽

10.1109/icmla.2019.00300 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tobias Schlosser ◽

Michael Friedrich ◽

Danny Kowerko

Keyword(s):

Machine Learning ◽

Image Processing ◽

Deep Learning ◽

Biologically Inspired ◽

Learning Framework ◽

Learning Conception ◽

Hexagonal Image Processing

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

Non-Intrusive Parametric Model Order Reduction with Error Correction Modeling for Changing Well Locations Using a Machine Learning Framework

10.2118/199042-ms ◽

2020 ◽

Author(s):

Hardikkumar Zalavadia ◽

Eduardo Gildin

Keyword(s):

Machine Learning ◽

Error Correction ◽

Model Order Reduction ◽

Order Reduction ◽

Parametric Model ◽

Model Order ◽

Parametric Model Order Reduction ◽

Learning Framework ◽

Error Correction Modeling

Download Full-text

Use of Machine Learning to Investigate the Quantitative Checklist for Autism in Toddlers (Q-CHAT) towards Early Autism Screening

Diagnostics ◽

10.3390/diagnostics11030574 ◽

2021 ◽

Vol 11 (3) ◽

pp. 574

Author(s):

Gennaro Tartarisco ◽

Giovanni Cicceri ◽

Davide Di Pietro ◽

Elisa Leonardi ◽

Stefania Aiello ◽

...

Keyword(s):

Machine Learning ◽

High Performance ◽

Behavioral Science ◽

Autistic Traits ◽

Classification Performance ◽

Recursive Feature Elimination ◽

Diagnostic Tools ◽

Support Vector ◽

K Nearest Neighbors ◽

Autism Screening

In the past two decades, several screening instruments were developed to detect toddlers who may be autistic both in clinical and unselected samples. Among others, the Quantitative CHecklist for Autism in Toddlers (Q-CHAT) is a quantitative and normally distributed measure of autistic traits that demonstrates good psychometric properties in different settings and cultures. Recently, machine learning (ML) has been applied to behavioral science to improve the classification performance of autism screening and diagnostic tools, but mainly in children, adolescents, and adults. In this study, we used ML to investigate the accuracy and reliability of the Q-CHAT in discriminating young autistic children from those without. Five different ML algorithms (random forest (RF), naïve Bayes (NB), support vector machine (SVM), logistic regression (LR), and K-nearest neighbors (KNN)) were applied to investigate the complete set of Q-CHAT items. Our results showed that ML achieved an overall accuracy of 90%, and the SVM was the most effective, being able to classify autism with 95% accuracy. Furthermore, using the SVM–recursive feature elimination (RFE) approach, we selected a subset of 14 items ensuring 91% accuracy, while 83% accuracy was obtained from the 3 best discriminating items in common to ours and the previously reported Q-CHAT-10. This evidence confirms the high performance and cross-cultural validity of the Q-CHAT, and supports the application of ML to create shorter and faster versions of the instrument, maintaining high classification accuracy, to be used as a quick, easy, and high-performance tool in primary-care settings.

Download Full-text

SCOUR: a stepwise machine learning framework for predicting metabolite-dependent regulatory interactions

BMC Bioinformatics ◽

10.1186/s12859-021-04281-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Justin Y. Lee ◽

Britney Nguyen ◽

Carlos Orosco ◽

Mark P. Styczynski

Keyword(s):

Machine Learning ◽

Metabolic Networks ◽

Sampling Frequency ◽

Low Noise ◽

Training Data ◽

High Noise ◽

Regulatory Interactions ◽

Learning Framework ◽

Metabolic Systems ◽

Noise Data

Abstract Background The topology of metabolic networks is both well-studied and remarkably well-conserved across many species. The regulation of these networks, however, is much more poorly characterized, though it is known to be divergent across organisms—two characteristics that make it difficult to model metabolic networks accurately. While many computational methods have been built to unravel transcriptional regulation, there have been few approaches developed for systems-scale analysis and study of metabolic regulation. Here, we present a stepwise machine learning framework that applies established algorithms to identify regulatory interactions in metabolic systems based on metabolic data: stepwise classification of unknown regulation, or SCOUR. Results We evaluated our framework on both noiseless and noisy data, using several models of varying sizes and topologies to show that our approach is generalizable. We found that, when testing on data under the most realistic conditions (low sampling frequency and high noise), SCOUR could identify reaction fluxes controlled only by the concentration of a single metabolite (its primary substrate) with high accuracy. The positive predictive value (PPV) for identifying reactions controlled by the concentration of two metabolites ranged from 32 to 88% for noiseless data, 9.2 to 49% for either low sampling frequency/low noise or high sampling frequency/high noise data, and 6.6–27% for low sampling frequency/high noise data, with results typically sufficiently high for lab validation to be a practical endeavor. While the PPVs for reactions controlled by three metabolites were lower, they were still in most cases significantly better than random classification. Conclusions SCOUR uses a novel approach to synthetically generate the training data needed to identify regulators of reaction fluxes in a given metabolic system, enabling metabolomics and fluxomics data to be leveraged for regulatory structure inference. By identifying and triaging the most likely candidate regulatory interactions, SCOUR can drastically reduce the amount of time needed to identify and experimentally validate metabolic regulatory interactions. As high-throughput experimental methods for testing these interactions are further developed, SCOUR will provide critical impact in the development of predictive metabolic models in new organisms and pathways.

Download Full-text