Lessons Learned from High Throughput Screening of >250,000 Cells: New Cell Evaluation Methods and Data Mining Techniques

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>

Download Full-text

Particularities of data mining in medicine: lessons learned from patient medical time series data analysis

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-019-1582-2 ◽

2019 ◽

Vol 2019 (1) ◽

Cited By ~ 2

Author(s):

Shadi Aljawarneh ◽

Aurea Anguera ◽

John William Atwood ◽

Juan A. Lara ◽

David Lizcano

Keyword(s):

Data Mining ◽

Time Series ◽

Knowledge Discovery ◽

Time Series Data ◽

Medical Patient ◽

Lessons Learned ◽

Physiological Signals ◽

Knowledge Discovery In Databases ◽

Series Data ◽

Data Mining Techniques

AbstractNowadays, large amounts of data are generated in the medical domain. Various physiological signals generated from different organs can be recorded to extract interesting information about patients’ health. The analysis of physiological signals is a hard task that requires the use of specific approaches such as the Knowledge Discovery in Databases process. The application of such process in the domain of medicine has a series of implications and difficulties, especially regarding the application of data mining techniques to data, mainly time series, gathered from medical examinations of patients. The goal of this paper is to describe the lessons learned and the experience gathered by the authors applying data mining techniques to real medical patient data including time series. In this research, we carried out an exhaustive case study working on data from two medical fields: stabilometry (15 professional basketball players, 18 elite ice skaters) and electroencephalography (100 healthy patients, 100 epileptic patients). We applied a previously proposed knowledge discovery framework for classification purpose obtaining good results in terms of classification accuracy (greater than 99% in both fields). The good results obtained in our research are the groundwork for the lessons learned and recommendations made in this position paper that intends to be a guide for experts who have to face similar medical data mining projects.

Download Full-text

Deriving Knowledge through Data Mining High-Throughput Screening Data

Journal of Medicinal Chemistry ◽

10.1021/jm049902r ◽

2004 ◽

Vol 47 (25) ◽

pp. 6373-6383 ◽

Cited By ~ 30

Author(s):

David J. Diller ◽

Doug W. Hobbs

Keyword(s):

Data Mining ◽

High Throughput ◽

High Throughput Screening

Download Full-text

Applications of Biophysics in High-Throughput Screening Hit Validation

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057114529462 ◽

2014 ◽

Vol 19 (5) ◽

pp. 707-714 ◽

Cited By ~ 15

Author(s):

Christine Clougherty Genick ◽

Danielle Barlier ◽

Dominique Monna ◽

Reto Brunner ◽

Céline Bé ◽

...

Keyword(s):

High Throughput ◽

High Throughput Screening ◽

Lessons Learned ◽

Label Free ◽

Lead Discovery ◽

Binding Interaction ◽

Differential Scanning Fluorimetry ◽

Types Of Information ◽

Hit Validation

For approximately a decade, biophysical methods have been used to validate positive hits selected from high-throughput screening (HTS) campaigns with the goal to verify binding interactions using label-free assays. By applying label-free readouts, screen artifacts created by compound interference and fluorescence are discovered, enabling further characterization of the hits for their target specificity and selectivity. The use of several biophysical methods to extract this type of high-content information is required to prevent the promotion of false positives to the next level of hit validation and to select the best candidates for further chemical optimization. The typical technologies applied in this arena include dynamic light scattering, turbidometry, resonance waveguide, surface plasmon resonance, differential scanning fluorimetry, mass spectrometry, and others. Each technology can provide different types of information to enable the characterization of the binding interaction. Thus, these technologies can be incorporated in a hit-validation strategy not only according to the profile of chemical matter that is desired by the medicinal chemists, but also in a manner that is in agreement with the target protein’s amenability to the screening format. Here, we present the results of screening strategies using biophysics with the objective to evaluate the approaches, discuss the advantages and challenges, and summarize the benefits in reference to lead discovery. In summary, the biophysics screens presented here demonstrated various hit rates from a list of ~2000 preselected, IC50-validated hits from HTS (an IC50 is the inhibitor concentration at which 50% inhibition of activity is observed). There are several lessons learned from these biophysical screens, which will be discussed in this article.

Download Full-text

Accelerated Discovery of High-Refractive-Index Polyimides via First-Principles Molecular Modeling, Virtual High-Throughput Screening, and Data Mining

The Journal of Physical Chemistry C ◽

10.1021/acs.jpcc.9b01147 ◽

2019 ◽

Vol 123 (23) ◽

pp. 14610-14618 ◽

Cited By ~ 9

Author(s):

Mohammad Atif Faiz Afzal ◽

Mojtaba Haghighatlari ◽

Sai Prasad Ganesh ◽

Chong Cheng ◽

Johannes Hachmann

Keyword(s):

Data Mining ◽

Refractive Index ◽

Molecular Modeling ◽

High Throughput ◽

First Principles ◽

High Throughput Screening ◽

High Refractive Index

Download Full-text

Mining Big Data and Streams

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch036 ◽

2018 ◽

pp. 406-417

Author(s):

Hoda Ahmed Abdelhafez

Keyword(s):

Data Mining ◽

Big Data ◽

Large Datasets ◽

Lessons Learned ◽

Velocity Data ◽

Time Data ◽

Data Mining Techniques ◽

Complex Information ◽

Very Large Datasets ◽

Data Volume

Mining big data is getting a lot of attention currently because the businesses need more complex information in order to increase their revenue and gain competitive advantage. Therefore, mining the huge amount of data as well as mining real-time data needs to be done by new data mining techniques/approaches. This chapter will discuss big data volume, variety and velocity, data mining techniques and open source tools for handling very large datasets. Moreover, the chapter will focus on two industrial areas telecommunications and healthcare and lessons learned from them.

Download Full-text

Mining Big Data and Streams

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch008 ◽

2019 ◽

pp. 94-107

Author(s):

Hoda Ahmed Abdelhafez

Keyword(s):

Data Mining ◽

Big Data ◽

Large Datasets ◽

Lessons Learned ◽

Velocity Data ◽

Time Data ◽

Data Mining Techniques ◽

Complex Information ◽

Very Large Datasets ◽

Data Volume

Mining big data is getting a lot of attention currently because businesses need more complex information in order to increase their revenue and gain competitive advantage. Therefore, mining the huge amount of data as well as mining real-time data needs to be done by new data mining techniques/approaches. This chapter will discuss big data volume, variety, and velocity, data mining techniques, and open source tools for handling very large datasets. Moreover, the chapter will focus on two industrial areas telecommunications and healthcare and lessons learned from them.

Download Full-text

Outlier Mining in High Throughput Screening Experiments

CrossRef Listing of Deleted DOIs ◽

10.1177/108705710200700406 ◽

2002 ◽

Vol 7 (4) ◽

pp. 341-351 ◽

Cited By ~ 16

Author(s):

Michael F.M. Engels ◽

Luc Wouters ◽

Rudi Verbeeck ◽

Greet Vanhoof

Keyword(s):

Data Mining ◽

Biological Activity ◽

High Throughput ◽

High Throughput Screening ◽

False Negative ◽

Data Sets ◽

Data Set ◽

Structure Information ◽

Screening Experiments ◽

Mining Procedure

A data mining procedure for the rapid scoring of high-throughput screening (HTS) compounds is presented. The method is particularly useful for monitoring the quality of HTS data and tracking outliers in automated pharmaceutical or agrochemical screening, thus providing more complete and thorough structure-activity relationship (SAR) information. The method is based on the utilization of the assumed relationship between the structure of the screened compounds and the biological activity on a given screen expressed on a binary scale. By means of a data mining method, a SAR description of the data is developed that assigns probabilities of being a hit to each compound of the screen. Then, an inconsistency score expressing the degree of deviation between the adequacy of the SAR description and the actual biological activity is computed. The inconsistency score enables the identification of potential outliers that can be primed for validation experiments. The approach is particularly useful for detecting false-negative outliers and for identifying SAR-compliant hit/nonhit borderline compounds, both of which are classes of compounds that can contribute substantially to the development and understanding of robust SARs. In a first implementation of the method, one- and two-dimensional descriptors are used for encoding molecular structure information and logistic regression for calculating hits/nonhits probability scores. The approach was validated on three data sets, the first one from a publicly available screening data set and the second and third from in-house HTS screening campaigns. Because of its simplicity, robustness, and accuracy, the procedure is suitable for automation.

Download Full-text

Experimental Screening of Dihydrofolate Reductase Yields a “Test Set” of 50,000 Small Molecules for a Computational Data-Mining and Docking Competition

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057105281173 ◽

2005 ◽

Vol 10 (7) ◽

pp. 653-657 ◽

Cited By ~ 22

Author(s):

Nadine H. Elowe ◽

Jan E. Blanchard ◽

Jonathan D. Cechetto ◽

Eric D. Brown

Keyword(s):

Escherichia Coli ◽

Data Mining ◽

Small Molecules ◽

High Throughput ◽

Dihydrofolate Reductase ◽

High Throughput Screening ◽

Valuable Resource ◽

Training Set ◽

Test Set ◽

Computational Data

High-throughput screening (HTS) generates an abundance of data that are a valuable resource to be mined. Dockers and data miners can use “real-world” HTS data to test and further develop their tools. A screen of 50,000 diverse small molecules was carried out against Escherichia coli dihydrofolate reductase (DHFR) and compared with a previous screen of 50,000 compounds against the same target. Identical assays and conditions were maintained for both studies. Prior to the completion of the second screen, the original screening data were publicly released for use as a “training set,” and computational chemists and data analysts were challenged to predict the activity of compounds in this second “test set.” Upon completion, the primary screen of the test set generated no potent inhibitors of DHFR activity.

Download Full-text