Increasing workflow development speed and reproducibility with Vectools

F1000Research ◽

10.12688/f1000research.16301.2 ◽

2018 ◽

Vol 7 ◽

pp. 1499

Author(s):

Tyler Weirick ◽

Raphael Müller ◽

Shizuka Uchida

Keyword(s):

Machine Learning ◽

Command Line ◽

Simple Machine ◽

Wide Range ◽

Command Line Tool ◽

Speed Up

Despite advances in bioinformatics, custom scripts remain a source of difficulty, slowing workflow development and hampering reproducibility. Here, we introduce Vectools, a command-line tool-suite to reduce reliance on custom scripts and improve reproducibility by offering a wide range of common easy-to-use functions for table and vector manipulation. Vectools also offers a number of vector related functions to speed up workflow development, such as simple machine learning and common statistics functions.

Easy-Prime: a machine learning–based prime editor design tool

Genome Biology ◽

10.1186/s13059-021-02458-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yichao Li ◽

Jingjing Chen ◽

Shengdar Q. Tsai ◽

Yong Cheng

Keyword(s):

Machine Learning ◽

Genome Editing ◽

Rna Folding ◽

Design Tool ◽

Data Sources ◽

Published Data ◽

Command Line ◽

Wide Range ◽

Command Line Tool ◽

Folding Structure

AbstractPrime editing is a revolutionary genome-editing technology that can make a wide range of precise edits in DNA. However, designing highly efficient prime editors (PEs) remains challenging. We develop Easy-Prime, a machine learning–based program trained with multiple published data sources. Easy-Prime captures both known and novel features, such as RNA folding structure, and optimizes feature combinations to improve editing efficiency. We provide optimized PE design for installation of 89.5% of 152,351 GWAS variants. Easy-Prime is available both as a command line tool and an interactive PE design server at: http://easy-prime.cc/.

UniverSC: a flexible cross-platform single-cell data processing pipeline

10.21203/rs.3.rs-244461/v1 ◽

2021 ◽

Author(s):

S. Kelly ◽

Kai Battenberg ◽

Nicola Hetherington ◽

Makoto Hayashi ◽

Aki Minoda

Keyword(s):

Single Cell ◽

Sequencing Analysis ◽

Command Line ◽

Rna Molecules ◽

Processing Pipeline ◽

Wide Range ◽

Single Cell Rna Sequencing ◽

Command Line Tool ◽

Cross Platform ◽

Cell Data

Abstract Single-cell RNA-sequencing analysis to quantify RNA molecules in individual cells has become popular owing to the large amount of information one can obtain from each experiment. We have developed UniverSC (https://github.com/minoda-lab/universc), a universal single-cell processing tool that supports any UMI-based platform. Our command-line tool enables consistent and comprehensive integration, comparison, and evaluation across data generated from a wide range of platforms.

UniverSC: a flexible cross-platform single-cell data processing pipeline

10.1101/2021.01.19.427209 ◽

2021 ◽

Author(s):

S. Thomas Kelly ◽

Kai Battenberg ◽

Nicola A. Hetherington ◽

Makoto Hayashi ◽

Aki Minoda

Keyword(s):

Single Cell ◽

Sequencing Analysis ◽

Command Line ◽

Rna Molecules ◽

Processing Pipeline ◽

Wide Range ◽

Single Cell Rna Sequencing ◽

Command Line Tool ◽

Cross Platform ◽

Cell Data

AbstractSingle-cell RNA-sequencing analysis to quantify RNA molecules in individual cells has become popular owing to the large amount of information one can obtain from each experiment. We have developed UniverSC (https://github.com/minoda-lab/universc), a universal single-cell processing tool that supports any UMI-based platform. Our command-line tool enables consistent and comprehensive integration, comparison, and evaluation across data generated from a wide range of platforms.

Accelerating the Design of Photocatalytic Surfaces for Antimicrobial Application: Machine Learning Based on a Sparse Dataset

Catalysts ◽

10.3390/catal11081001 ◽

2021 ◽

Vol 11 (8) ◽

pp. 1001

Author(s):

Heesoo Park ◽

El Tayeb Bentria ◽

Sami Rtimi ◽

Abdelilah Arredouani ◽

Halima Bensmail ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Trial And Error ◽

Public Places ◽

Antimicrobial Surfaces ◽

Design And Optimization ◽

Antimicrobial Materials ◽

Wide Range ◽

Speed Up ◽

The Cost

Nowadays, most experiments to synthesize and test photocatalytic antimicrobial materials are based on trial and error. More often than not, the mechanism of action of the antimicrobial activity is unknown for a large spectrum of microorganisms. Here, we propose a scheme to speed up the design and optimization of photocatalytic antimicrobial surfaces tailored to give a balanced production of reactive oxygen species (ROS) upon illumination. Using an experiment-to-machine-learning scheme applied to a limited experimental dataset, we built a model that can predict the photocatalytic activity of materials for antimicrobial applications over a wide range of material compositions. This machine-learning-assisted strategy offers the opportunity to reduce the cost, labor, time, and precursors consumed during experiments that are based on trial and error. Our strategy may significantly accelerate the large-scale deployment of photocatalysts as a promising route to mitigate fomite transmission of pathogens (bacteria, viruses, fungi) in hospital settings and public places.

Prediction of prokaryotic transposases from protein features with machine learning approaches

Microbial Genomics ◽

10.1099/mgen.0.000611 ◽

2021 ◽

Vol 7 (7) ◽

Author(s):

Qian Wang ◽

Jun Ye ◽

Teng Xu ◽

Ning Zhou ◽

Zhongqiu Lu ◽

...

Keyword(s):

Machine Learning ◽

Antibiotic Resistance ◽

Mutual Information ◽

Ensemble Classifier ◽

Training Dataset ◽

Command Line ◽

Learning Approaches ◽

Command Line Tool ◽

Selection Operator ◽

Insight Into

Identification of prokaryotic transposases (Tnps) not only gives insight into the spread of antibiotic resistance and virulence but the process of DNA movement. This study aimed to develop a classifier for predicting Tnps in bacteria and archaea using machine learning (ML) approaches. We extracted a total of 2751 protein features from the training dataset including 14852 Tnps and 14852 controls, and selected 75 features as predictive signatures using the combined mutual information and least absolute shrinkage and selection operator algorithms. By aggregating these signatures, an ensemble classifier that integrated a collection of individual ML-based classifiers, was developed to identify Tnps. Further validation revealed that this classifier achieved good performance with an average AUC of 0.955, and met or exceeded other common methods. Based on this ensemble classifier, a stand-alone command-line tool designated TnpDiscovery was established to maximize the convenience for bioinformaticians and experimental researchers toward Tnp prediction. This study demonstrates the effectiveness of ML approaches in identifying Tnps, facilitating the discovery of novel Tnps in the future.

Mining Pareto-Optimal Counterfactual Antecedents With A Branch-And-Bound Model-Agnostic Algorithm

10.21203/rs.3.rs-551661/v1 ◽

2021 ◽

Author(s):

Marcos M. Raimundo ◽

Luis Gustavo Nonato ◽

Jorge Poco

Keyword(s):

Machine Learning ◽

Linear Models ◽

Pareto Optimal ◽

Tree Search ◽

Original Sample ◽

Objective Functions ◽

Machine Learning Model ◽

Wide Range ◽

Speed Up ◽

Machine Learning Models

Abstract Mining counterfactual antecedents became a valuable tool to discover knowledge and explain machine learning models. It consists of generating synthetic samples from an original sample to achieve the desired outcome in a machine learning model thus helping to understand the prediction. An insightful methodology would explore a broader set of counterfactual antecedents to reveal multiple possibilities while operating on any classifier. Thus, we create a tree-based search that requires monotonicity from the objective functions (a.k.a. cost functions); it allows pruning branches that will not improve the objective functions. Since monotonicity is only required for the objective function, this method can be used for any family of classifiers (e.g., linear models, neural networks, decision trees). However, additional classifier properties speed up the tree-search when it foresees branches that will not result in feasible actions. Moreover, the proposed optimization generates a diverse set of Pareto-optimal counterfactual antecedents by relying on multi-objective concepts. The results show an algorithm with working guarantees that enumerates a wide range of counterfactual antecedents. It helps the decision-maker understand the machine learning decision and finds alternatives to achieve the desired outcome. The user can inspect these multiple counterfactual antecedents to find the most suitable one and have a broader understanding of the prediction.

PyBDA: a command line tool for automated analysis of big biological data sets

BMC Bioinformatics ◽

10.1186/s12859-019-3087-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Simon Dirmeier ◽

Mario Emmenlauer ◽

Christoph Dehio ◽

Niko Beerenwinkel

Keyword(s):

Machine Learning ◽

High Performance ◽

Single Cells ◽

Automated Analysis ◽

Biological Data ◽

Machine Learning Algorithms ◽

Data Sets ◽

Command Line ◽

Command Line Tool ◽

High Performance Computing Cluster

Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.

nQuire: A Statistical Framework For Ploidy Estimation Using Next Generation Sequencing

10.1101/143537 ◽

2017 ◽

Cited By ~ 1

Author(s):

Clemens L. Weiß ◽

Marina Pais ◽

Liliana M. Cano ◽

Sophien Kamoun ◽

Hernán A. Burbano

Keyword(s):

Next Generation Sequencing ◽

Intraspecific Variation ◽

Ploidy Level ◽

Three Dimensions ◽

Command Line ◽

Next Generation ◽

Statistical Framework ◽

Wide Range ◽

Command Line Tool ◽

Generation Sequencing

AbstractIntraspecific variation in ploidy occurs in a wide range of species including pathogenic and nonpathogenic eukaryotes such as yeasts and oomycetes. Ploidy can be inferred indirectly - without measuring DNA content - from experiments using next-generation sequencing (NGS). We present nQuire, a statistical framework that distinguishes between diploids, triploids and tetraploids using NGS. The command-line tool models the distribution of base frequencies at variable sites using a Gaussian Mixture Model, and uses maximum likelihood to select the most plausible ploidy model. nQuire handles large genomes at high coverage efficiently and uses standard input file formats.We demonstrate the utility of nQuire analyzing individual samples of the pathogenic oomycete Phytophthora infestans and the Baker’s yeast Saccharomyces cerevisiae. Using these organisms we show the dependence between reliability of the ploidy assignment and sequencing depth. Additionally, we employ normalized maximized log-likelihoods generated by nQuire to ascertain ploidy level in a population of samples with ploidy heterogeneity. Using these normalized values we cluster samples in three dimensions using multivariate Gaussian mixtures. The cluster assignments retrieved from a S. cerevisiae population recovered the true ploidy level in over 96% of samples. Finally, we show that nQuire can be used regionally to identify chromosomal aneuploidies.nQuire provides a statistical framework to study organisms with intraspecific variation in ploidy. nQuire is likely to be useful in epidemiological studies of pathogens, artificial selection experiments, and for historical or ancient samples where intact nuclei are not preserved. It is implemented as a stand-alone Linux command line tool in the C programming language and is available at github.com/clwgg/nQuire under the MIT license.

Efficient Prediction of Structural and Electronic Properties of Hybrid 2D Materials Using DFT and Machine Learning

10.26434/chemrxiv.6254756.v1 ◽

2018 ◽

Author(s):

Sherif Tawfik ◽

Olexandr Isayev ◽

Catherine Stampfl ◽

Joseph Shapter ◽

David Winkler ◽

...

Keyword(s):

Machine Learning ◽

Band Gap ◽

Density Functional ◽

2D Materials ◽

Van Der Waals ◽

Building Blocks ◽

Machine Learning Techniques ◽

Interlayer Distance ◽

Computational Screening ◽

Wide Range

Materials constructed from different van der Waals two-dimensional (2D) heterostructures offer a wide range of benefits, but these systems have been little studied because of their experimental and computational complextiy, and because of the very large number of possible combinations of 2D building blocks. The simulation of the interface between two different 2D materials is computationally challenging due to the lattice mismatch problem, which sometimes necessitates the creation of very large simulation cells for performing density-functional theory (DFT) calculations. Here we use a combination of DFT, linear regression and machine learning techniques in order to rapidly determine the interlayer distance between two different 2D heterostructures that are stacked in a bilayer heterostructure, as well as the band gap of the bilayer. Our work provides an excellent proof of concept by quickly and accurately predicting a structural property (the interlayer distance) and an electronic property (the band gap) for a large number of hybrid 2D materials. This work paves the way for rapid computational screening of the vast parameter space of van der Waals heterostructures to identify new hybrid materials with useful and interesting properties.