Artificial Neural Network—Based Analysis of High-Throughput Screening Data for Improved Prediction of Active Compounds

Artificial neural networks (ANNs) are trained using high-throughput screening (HTS) data to recover active compounds from a large data set. Improved classification performance was obtained on combining predictions made by multiple ANNs. The HTS data, acquired from a methionine aminopeptidases inhibition study, consisted of a library of 43,347 compounds, and the ratio of active to nonactive compounds, R A/N, was 0.0321. Back-propagation ANNs were trained and validated using principal components derived from the physicochemical features of the compounds. On selecting the training parameters carefully, an ANN recovers one-third of all active compounds from the validation set with a 3-fold gain in R A/N value. Further gains in RA/N values were obtained upon combining the predictions made by a number of ANNs. The generalization property of the back-propagation ANNs was used to train those ANNs with the same training samples, after being initialized with different sets of random weights. As a result, only 10% of all available compounds were needed for training and validation, and the rest of the data set was screened with more than a 10-fold gain of the original RA/N value. Thus, ANNs trained with limited HTS data might become useful in recovering active compounds from large data sets.

Download Full-text

A Propagated Skeleton Approach to High Throughput Screening of Neurite Outgrowth for In Vitro Parkinson’s Disease Modelling

Cells ◽

10.3390/cells10040931 ◽

2021 ◽

Vol 10 (4) ◽

pp. 931

Author(s):

Justus Schikora ◽

Nina Kiwatrowski ◽

Nils Förster ◽

Leonie Selbach ◽

Friederike Ostendorf ◽

...

Keyword(s):

Parkinson’S Disease ◽

Parkinson's Disease ◽

Neurite Outgrowth ◽

High Throughput ◽

High Throughput Screening ◽

Image Data ◽

Large Data ◽

Therapeutic Effects ◽

Data Sets

Neuronal models of neurodegenerative diseases such as Parkinson’s Disease (PD) are extensively studied in pathological and therapeutical research with neurite outgrowth being a core feature. Screening of neurite outgrowth enables characterization of various stimuli and therapeutic effects after lesion. In this study, we describe an autonomous computational assay for a high throughput skeletonization approach allowing for quantification of neurite outgrowth in large data sets from fluorescence microscopic imaging. Development and validation of the assay was conducted with differentiated SH-SY5Y cells and primary mesencephalic dopaminergic neurons (MDN) treated with the neurotoxic lesioning compound Rotenone. Results of manual annotation using NeuronJ and automated data were shown to correlate strongly (R2-value 0.9077 for SH-SY5Y cells and R2-value 0.9297 for MDN). Pooled linear regressions of results from SH-SY5Y cell image data could be integrated into an equation formula (y=0.5410·x+1792; y=0.8789·x+0.09191 for normalized results) with y depicting automated and x depicting manual data. This automated neurite length algorithm constitutes a valuable tool for modelling of neurite outgrowth that can be easily applied to evaluate therapeutic compounds with high throughput approaches.

Download Full-text

INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

International Journal of Computing ◽

10.47839/ijc.11.3.565 ◽

2014 ◽

pp. 215-223

Author(s):

Dipak V. Patil ◽

Rajankumar S. Bichkar

Keyword(s):

Decision Trees ◽

Data Cleaning ◽

Large Data ◽

Classification Performance ◽

Large Data Sets ◽

Optimal Decision ◽

Data Sets ◽

Data Set ◽

Outlier Data ◽

Reduced Data

The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.

Download Full-text

The effect of Z-Score standardization (normalization) on binary input due the speed of learning in back-propagation neural network

Iraqi Journal of Information & Communications Technology ◽

10.31987/ijict.1.3.41 ◽

2019 ◽

Vol 1 (3) ◽

pp. 42-48

Author(s):

Mohammed Z. Al-Faiz ◽

Ali A. Ibrahim ◽

Sarmad M. Hadi

Keyword(s):

Neural Network ◽

Input Data ◽

Back Propagation ◽

Large Data ◽

Back Propagation Neural Network ◽

Data Sets ◽

Z Score ◽

Data Set ◽

The Neural Network ◽

Time Required

The speed of learning in neural network environment is considered as the most effective parameter spatially in large data sets. This paper tries to minimize the time required for the neural network to fully understand and learn about the data by standardize input data. The paper showed that the Z-Score standardization of input data significantly decreased the number of epoochs required for the network to learn. This paper also proved that the binary dataset is a serious limitation for the convergence of neural network, so the standardization is a must in such case where the 0’s inputs simply neglect the connections in the neural network. The data set used in this paper are features extracted from gel electrophoresis images and that open the door for using artificial intelligence in such areas.

Download Full-text

Outlier Mining in High Throughput Screening Experiments

CrossRef Listing of Deleted DOIs ◽

10.1177/108705710200700406 ◽

2002 ◽

Vol 7 (4) ◽

pp. 341-351 ◽

Cited By ~ 16

Author(s):

Michael F.M. Engels ◽

Luc Wouters ◽

Rudi Verbeeck ◽

Greet Vanhoof

Keyword(s):

Data Mining ◽

Biological Activity ◽

High Throughput ◽

High Throughput Screening ◽

False Negative ◽

Data Sets ◽

Data Set ◽

Structure Information ◽

Screening Experiments ◽

Mining Procedure

A data mining procedure for the rapid scoring of high-throughput screening (HTS) compounds is presented. The method is particularly useful for monitoring the quality of HTS data and tracking outliers in automated pharmaceutical or agrochemical screening, thus providing more complete and thorough structure-activity relationship (SAR) information. The method is based on the utilization of the assumed relationship between the structure of the screened compounds and the biological activity on a given screen expressed on a binary scale. By means of a data mining method, a SAR description of the data is developed that assigns probabilities of being a hit to each compound of the screen. Then, an inconsistency score expressing the degree of deviation between the adequacy of the SAR description and the actual biological activity is computed. The inconsistency score enables the identification of potential outliers that can be primed for validation experiments. The approach is particularly useful for detecting false-negative outliers and for identifying SAR-compliant hit/nonhit borderline compounds, both of which are classes of compounds that can contribute substantially to the development and understanding of robust SARs. In a first implementation of the method, one- and two-dimensional descriptors are used for encoding molecular structure information and logistic regression for calculating hits/nonhits probability scores. The approach was validated on three data sets, the first one from a publicly available screening data set and the second and third from in-house HTS screening campaigns. Because of its simplicity, robustness, and accuracy, the procedure is suitable for automation.

Download Full-text

Microfluidic Technologies for High Throughput Screening Through Sorting and On-Chip Culture of C. elegans

Molecules ◽

10.3390/molecules24234292 ◽

2019 ◽

Vol 24 (23) ◽

pp. 4292 ◽

Cited By ~ 5

Author(s):

Daniel Midkiff ◽

Adriana San-Miguel

Keyword(s):

High Throughput ◽

High Throughput Screening ◽

Model Organism ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

C Elegans ◽

Drug Candidates ◽

On Chip ◽

And Behavior

The nematode Caenorhabditis elegans is a powerful model organism that has been widely used to study molecular biology, cell development, neurobiology, and aging. Despite their use for the past several decades, the conventional techniques for growth, imaging, and behavioral analysis of C. elegans can be cumbersome, and acquiring large data sets in a high-throughput manner can be challenging. Developments in microfluidic “lab-on-a-chip” technologies have improved studies of C. elegans by increasing experimental control and throughput. Microfluidic features such as on-chip control layers, immobilization channels, and chamber arrays have been incorporated to develop increasingly complex platforms that make experimental techniques more powerful. Genetic and chemical screens are performed on C. elegans to determine gene function and phenotypic outcomes of perturbations, to test the effect that chemicals have on health and behavior, and to find drug candidates. In this review, we will discuss microfluidic technologies that have been used to increase the throughput of genetic and chemical screens in C. elegans. We will discuss screens for neurobiology, aging, development, behavior, and many other biological processes. We will also discuss robotic technologies that assist in microfluidic screens, as well as alternate platforms that perform functions similar to microfluidics.

Download Full-text

Statistical models for identifying frequent hitters in high throughput screening

Scientific Reports ◽

10.1038/s41598-020-74139-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Samuel Goodwin ◽

Golnaz Shahtahmassebi ◽

Quentin S. Hanley

Keyword(s):

Drug Discovery ◽

High Throughput ◽

High Throughput Screening ◽

Negative Binomial ◽

Data Sets ◽

Individual Compound ◽

Data Set ◽

Binomial Distributions ◽

Compound Libraries ◽

Large Numbers

Abstract High throughput screening (HTS) interrogates compound libraries to find those that are “active” in an assay. To better understand compound behavior in HTS, we assessed an existing binomial survivor function (BSF) model of “frequent hitters” using 872 publicly available HTS data sets. We found large numbers of “infrequent hitters” using this model leading us to reject the BSF for identifying “frequent hitters.” As alternatives, we investigated generalized logistic, gamma, and negative binomial distributions as models for compound behavior. The gamma model reduced the proportion of both frequent and infrequent hitters relative to the BSF. Within this data set, conclusions about individual compound behavior were limited by the number of times individual compounds were tested (1–1613 times) and disproportionate testing of some compounds. Specifically, most tests (78%) were on a 309,847-compound subset (17.6% of compounds) each tested ≥ 300 times. We concluded that the disproportionate retesting of some compounds represents compound repurposing at scale rather than drug discovery. The approach to drug discovery represented by these 872 data sets characterizes the assays well by challenging them with many compounds while each compound is characterized poorly with a single assay. Aggregating the testing information from each compound across the multiple screens yielded a continuum with no clear boundary between normal and frequent hitting compounds.

Download Full-text

Hardware Accelerated Segmentation of Complex Volumetric Filament Networks

IEEE Transactions on Visualization and Computer Graphics ◽

10.1109/tvcg.2008.196 ◽

2009 ◽

Vol 15 (4) ◽

pp. 670-681 ◽

Cited By ~ 11

Author(s):

David Mayerich ◽

John Keyser

Keyword(s):

High Throughput ◽

Large Data ◽

Graphics Hardware ◽

Data Sets ◽

Volume Data ◽

Data Set ◽

Thin Filaments ◽

Novel Method ◽

Filament Networks ◽

Microscopy Data

We present a framework for segmenting and storing filament networks from scalar volume data. Filament networks are encountered more and more commonly in biomedical imaging due to advances in high-throughput microscopy. These data sets are characterized by a complex volumetric network of thin filaments embedded in a scalar volume field. High-throughput microscopy volumes are also difficult to manage since they can require several terabytes of storage, even though the total volume of the embedded structure is much smaller. Filaments in microscopy data sets are difficult to segment because their diameter is often near the sampling resolution of the microscope, yet these networks can span large regions of the data set. We describe a novel method to trace filaments through scalar volume data sets that is robust to both noisy and undersampled data. We use graphics hardware to accelerate the tracing algorithm, making it more useful for large data sets. After the initial network is traced, we use an efficient encoding scheme to store volumetric data pertaining to the network.

Download Full-text

Introducing Bayesian Thinking to High-Throughput Screening for False-Negative Rate Estimation

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057113491495 ◽

2013 ◽

Vol 18 (9) ◽

pp. 1121-1131 ◽

Cited By ~ 1

Author(s):

Xin Wei ◽

Lin Gao ◽

Xiaolei Zhang ◽

Hong Qian ◽

Karen Rowan ◽

...

Keyword(s):

High Throughput ◽

High Throughput Screening ◽

False Negative ◽

False Negative Rate ◽

Training Data ◽

Distribution Profile ◽

False Negatives ◽

Data Set ◽

Active Compounds ◽

Negative Rate

High-throughput screening (HTS) has been widely used to identify active compounds (hits) that bind to biological targets. Because of cost concerns, the comprehensive screening of millions of compounds is typically conducted without replication. Real hits that fail to exhibit measurable activity in the primary screen due to random experimental errors will be lost as false-negatives. Conceivably, the projected false-negative rate is a parameter that reflects screening quality. Furthermore, it can be used to guide the selection of optimal numbers of compounds for hit confirmation. Therefore, a method that predicts false-negative rates from the primary screening data is extremely valuable. In this article, we describe the implementation of a pilot screen on a representative fraction (1%) of the screening library in order to obtain information about assay variability as well as a preliminary hit activity distribution profile. Using this training data set, we then developed an algorithm based on Bayesian logic and Monte Carlo simulation to estimate the number of true active compounds and potential missed hits from the full library screen. We have applied this strategy to five screening projects. The results demonstrate that this method produces useful predictions on the numbers of false negatives.

Download Full-text

SpCas9 activity prediction by DeepSpCas9, a deep learning-based model with unparalleled generalization performance

10.1101/636472 ◽

2019 ◽

Cited By ~ 2

Author(s):

Hui Kwon Kim ◽

Younggwang Kim ◽

Sungtae Lee ◽

Seonwoo Min ◽

Jung Yoon Bae ◽

...

Keyword(s):

Deep Learning ◽

High Throughput ◽

Human Cell ◽

Large Data ◽

Data Sets ◽

Target Sequence ◽

Activity Prediction ◽

Data Set ◽

Generalization Performance ◽

Cell Library

AbstractWe evaluated SpCas9 activities at 12,832 target sequences using a high-throughput approach based on a human cell library containing sgRNA-encoding and target sequence pairs. Deep learning-based training on this large data set of SpCas9-induced indel frequencies led to the development of a SpCas9-activity predicting model named DeepSpCas9. When tested against independently generated data sets (our own and those published by other groups), DeepSpCas9 showed unprecedentedly high generalization performance. DeepSpCas9 is available athttp://deepcrispr.info/DeepCas9.

Download Full-text

Some statistical and CI models to predict chaotic high-frequency financial data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189107 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6419-6430

Author(s):

Dusan Marcek

Keyword(s):

Time Series Data ◽

Moving Average ◽

Methodological Approach ◽

Back Propagation ◽

Large Data ◽

Series Data ◽

Data Set ◽

Training Time ◽

Optimal Population ◽

Forecast Time

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Download Full-text