A visual approach for analysis and inference of molecular activity spaces

Abstract Background Molecular space visualization can help to explore the diversity of large heterogeneous chemical data, which ultimately may increase the understanding of structure-activity relationships (SAR) in drug discovery projects. Visual SAR analysis can therefore be useful for library design, chemical classification for their biological evaluation and virtual screening for the selection of compounds for synthesis or in vitro testing. As such, computational approaches for molecular space visualization have become an important issue in cheminformatics research. The proposed approach uses molecular similarity as the sole input for computing a probabilistic surface of molecular activity (PSMA). This similarity matrix is transformed in 2D using different dimension reduction algorithms (Principal Coordinates Analysis ( PCooA), Kruskal multidimensional scaling, Sammon mapping and t-SNE). From this projection, a kernel density function is applied to compute the probability of activity for each coordinate in the new projected space. Results This methodology was tested over four different quantitative structure-activity relationship (QSAR) binary classification data sets and the PSMAs were computed for each. The generated maps showed internal consistency with active molecules grouped together for all data sets and all dimensionality reduction algorithms. To validate the quality of the generated maps, the 2D coordinates of test molecules were computed into the new reference space using a data transformation matrix. In total sixteen PSMAs were built, and their performance was assessed using the Area Under Curve (AUC) and the Matthews Coefficient Correlation (MCC). For the best projections for each data set, AUC testing results ranged from 0.87 to 0.98 and the MCC scores ranged from 0.33 to 0.77, suggesting this methodology can validly capture the complexities of the molecular activity space. All four mapping functions provided generally good results yet the overall performance of PCooA and t-SNE was slightly better than Sammon mapping and Kruskal multidimensional scaling. Conclusions Our result showed that by using an appropriate combination of metric space representation and dimensionality reduction applied over metric spaces it is possible to produce a visual PSMA for which its consistency has been validated by using this map as a classification model. The produced maps can be used as prediction tools as it is simple to project any molecule into this new reference space as long as the similarities to the molecules used to compute the initial similarity matrix can be computed.

Download Full-text

Bayesian Classifier for Sparsity-Promoting Feature Selection

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500226 ◽

2015 ◽

Vol 29 (06) ◽

pp. 1550022 ◽

Cited By ~ 1

Author(s):

Danlei Xu ◽

Lan Du ◽

Hongwei Liu ◽

Penghui Wang

Keyword(s):

Feature Selection ◽

Synthetic Data ◽

Original Data ◽

Radar Data ◽

Bayesian Classifier ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Classification Boundary ◽

Nonlinear Mappings

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.

Download Full-text

SAR Modelling of Complex Phenomena: Probing Methodological Limitations

Alternatives to Laboratory Animals ◽

10.1177/026119290303100405 ◽

2003 ◽

Vol 31 (4) ◽

pp. 393-399

Author(s):

Herbert S. Rosenkranz

Keyword(s):

Contact Dermatitis ◽

Allergic Contact Dermatitis ◽

Systemic Toxicity ◽

Activity Relationship ◽

Data Sets ◽

Data Set ◽

Structure Activity ◽

Model Complex ◽

Complex Phenomena ◽

Allergic Contact

The increased acceptance of the use of structure–activity relationship (SAR) approaches to toxicity modelling has necessitated an evaluation of the limitations of the methodology. In this study, the limit of the capacity of the MULTICASE SAR program to model complex biological and toxicological phenomena was assessed. It was estimated that, provided the data set consists of at least 300 chemicals, divided equally between active and inactive compounds, the program is capable of handling phenomena that are even more “complex” than those modelled up to now (for example, allergic contact dermatitis, Salmonella mutagenicity, biodegradability, inhibition of tubulin polymerisation). However, within the data sets currently used to generate SAR models, there are limits to the complexity that can be handled. This may be the situation with regard to the modelling of systemic toxicity (for example, the LD50).

Download Full-text

A New Approach for Supervised Dimensionality Reduction

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2018100102 ◽

2018 ◽

Vol 14 (4) ◽

pp. 20-37 ◽

Cited By ~ 1

Author(s):

Yinglei Song ◽

Yongzhong Li ◽

Junfeng Qu

Keyword(s):

Eigenvalue Problem ◽

Dimensionality Reduction ◽

Image Databases ◽

Data Sets ◽

Data Set ◽

New Approach ◽

Local Structures ◽

Benchmark Data ◽

Global And Local

This article develops a new approach for supervised dimensionality reduction. This approach considers both global and local structures of a labelled data set and maximizes a new objective that includes the effects from both of them. The objective can be approximately optimized by solving an eigenvalue problem. The approach is evaluated based on a few benchmark data sets and image databases. Its performance is also compared with a few other existing approaches for dimensionality reduction. Testing results show that, on average, this new approach can achieve more accurate results for dimensionality reduction than existing approaches.

Download Full-text

Weber's Ratio, Multidimensional Scaling and Incomplete Data Sets: New Light on an Old Problem

Proceedings of the Human Factors Society Annual Meeting ◽

10.1177/154193128803201713 ◽

1988 ◽

Vol 32 (17) ◽

pp. 1183-1187

Author(s):

J. G. Kreifeldt ◽

S. H. Levine ◽

M. C. Chuang

Keyword(s):

Multidimensional Scaling ◽

Incomplete Data ◽

Data Sets ◽

Observation Data ◽

Data Set ◽

New Approach ◽

Sensory Modalities ◽

Minimal Difference ◽

Typical Measurement ◽

Measurement Context

Sensory modalities exhibit a characteristic known as Weber's ratio which remarks that when two stimuli are compared for a difference: (1) there is some minimal nonzero difference which can be differentiated and (2) this minimal difference is a nearly constant proportion of the magnitude of the stimuli. Both of these would, in a typical measurement context, appear to be system defects. We have found through simulation explorations that in fact these are apparently the characteristics required by a system designed to extract an adequate amount of information from an incomplete observation data set according to a new approach to measurement.

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

Locality Sensitive Discriminative Unsupervised Dimensionality Reduction

Symmetry ◽

10.3390/sym11081036 ◽

2019 ◽

Vol 11 (8) ◽

pp. 1036 ◽

Cited By ~ 2

Author(s):

Yun-Long Gao ◽

Si-Zhe Luo ◽

Zhi-Hao Wang ◽

Chih-Cheng Chen ◽

Jin-Yan Pan

Keyword(s):

Dimensionality Reduction ◽

Laplacian Matrix ◽

Original Data ◽

Original Structure ◽

Data Sets ◽

Similarity Matrix ◽

High Dimensions ◽

Local Distance ◽

Synthetic Datasets ◽

Embedding Methods

Graph-based embedding methods receive much attention due to the use of graph and manifold information. However, conventional graph-based embedding methods may not always be effective if the data have high dimensions and have complex distributions. First, the similarity matrix only considers local distance measurement in the original space, which cannot reflect a wide variety of data structures. Second, separation of graph construction and dimensionality reduction leads to the similarity matrix not being fully relied on because the original data usually contain lots of noise samples and features. In this paper, we address these problems by constructing two adjacency graphs to stand for the original structure featuring similarity and diversity of the data, and then impose a rank constraint on the corresponding Laplacian matrix to build a novel adaptive graph learning method, namely locality sensitive discriminative unsupervised dimensionality reduction (LSDUDR). As a result, the learned graph shows a clear block diagonal structure so that the clustering structure of data can be preserved. Experimental results on synthetic datasets and real-world benchmark data sets demonstrate the effectiveness of our approach.

Download Full-text

Data Cleaning for Classification Using Misclassification Analysis

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2010.p0297 ◽

2010 ◽

Vol 14 (3) ◽

pp. 297-302 ◽

Cited By ~ 32

Author(s):

Piyasak Jeatrakul ◽

◽

Kok Wai Wong ◽

Chun Che Fung

Keyword(s):

Data Cleaning ◽

Binary Classification ◽

Good Alternative ◽

Training Data ◽

Classification Model ◽

Data Sets ◽

Pima Indians ◽

Classification Problems ◽

Data Set ◽

Preprocessing Technique

In most classification problems, sometimes in order to achieve better results, data cleaning is used as a preprocessing technique. The purpose of data cleaning is to remove noise, inconsistent data and errors in the training data. This should enable the use of a better and representative data set to develop a reliable classification model. In most classification models, unclean data could sometime affect the classification accuracies of a model. In this paper, we investigate the use of misclassification analysis for data cleaning. In order to demonstrate our concept, we have used Artificial Neural Network (ANN) as the core computational intelligence technique. We use four benchmark data sets obtained from the University of California Irvine (UCI) machine learning repository to investigate the results from our proposed data cleaning technique. The experimental data sets used in our experiment are binary classification problems, which are German credit data, BUPA liver disorders, Johns Hopkins Ionosphere and Pima Indians Diabetes. The results show that the proposed cleaning technique could be a good alternative to provide some confidence when constructing a classification model.

Download Full-text

Synthesising artificial patient-level data for Open Science - an evaluation of five methods

10.1101/2020.10.09.20210138 ◽

2020 ◽

Author(s):

Michael Allen ◽

Andrew Salmon

Keyword(s):

Breast Cancer ◽

Logistic Regression ◽

Synthetic Data ◽

Original Data ◽

Classification Model ◽

Data Sets ◽

List Type ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data

ABSTRACTBackgroundOpen science is a movement seeking to make scientific research accessible to all, including publication of code and data. Publishing patient-level data may, however, compromise the confidentiality of that data if there is any significant risk that data may later be associated with individuals. Use of synthetic data offers the potential to be able to release data that may be used to evaluate methods or perform preliminary research without risk to patient confidentiality.MethodsWe have tested five synthetic data methods:A technique based on Principal Component Analysis (PCA) which samples data from distributions derived from the transformed data.Synthetic Minority Oversampling Technique, SMOTE which is based on interpolation between near neighbours.Generative Adversarial Network, GAN, an artificial neural network approach with competing networks - a discriminator network trained to distinguish between synthetic and real data., and a generator network trained to produce data that can fool the discriminator network.CT-GAN, a refinement of GANs specifically for the production of structured tabular synthetic data.Variational Auto Encoders, VAE, a method of encoding data in a reduced number of dimensions, and sampling from distributions based on the encoded dimensions.Two data sets are used to evaluate the methods:The Wisconsin Breast Cancer data set, a histology data set where all features are continuous variables.A stroke thrombolysis pathway data set, a data set describing characteristics for patients where a decision is made whether to treat with clot-busting medication. Features are mostly categorical, binary, or integers.Methods are evaluated in three ways:The ability of synthetic data to train a logistic regression classification model.A comparison of means and standard deviations between original and synthetic data.A comparison of covariance between features in the original and synthetic data.ResultsUsing the Wisconsin Breast Cancer data set, the original data gave 98% accuracy in a logistic regression classification model. Synthetic data sets gave between 93% and 99% accuracy. Performance (best to worst) was SMOTE > PCA > GAN > CT-GAN = VAE. All methods produced a high accuracy in reproducing original data means and stabdard deviations (all R-square > 0.96 for all methods and data classes). CT-GAN and VAE suffered a significant loss of covariance between features in the synthetic data sets.Using the Stroke Pathway data set, the original data gave 82% accuracy in a logistic regression classification model. Synthetic data sets gave between 66% and 82% accuracy. Performance (best to worst) was SMOTE > PCA > CT-GAN > GAN > VAE. CT-GAN and VAE suffered loss of covariance between features in the synthetic data sets, though less pronounced than with the Wisconsin Breast Cancer data set.ConclusionsThe pilot work described here shows, as proof of concept, that synthetic data may be produced, which is of sufficient quality to publish with open methodology, to allow people to better understand and test methodology. The quality of the synthetic data also gives promise of data sets that may be used for screening of ideas, or for research project (perhaps especially in an education setting).More work is required to further refine and test methods across a broader range of patient-level data sets.

Download Full-text

Performance of a convolutional neural network algorithm for tooth detection and numbering on periapical radiographs

Dentomaxillofacial Radiology ◽

10.1259/dmfr.20210246 ◽

2021 ◽

Author(s):

Cansu Görürgöz ◽

Kaan Orhan ◽

Ibrahim Sevki Bayrakdar ◽

Özer Çelik ◽

Elif Bilgir ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Confusion Matrix ◽

True Positive Rate ◽

Turnaround Time ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Region Detection ◽

Positive Rate

Objectives: The present study aimed to evaluate the performance of a Faster Region-based Convolutional Neural Network (R-CNN) algorithm for tooth detection and numbering on periapical images. Methods: The data sets of 1686 randomly selected periapical radiographs of patients were collected retrospectively. A pre-trained model (GoogLeNet Inception v3 CNN) was employed for pre-processing, and transfer learning techniques were applied for data set training. The algorithm consisted of: (1) the Jaw classification model, (2) Region detection models, and (3) the Final algorithm using all models. Finally, an analysis of the latest model has been integrated alongside the others. The sensitivity, precision, true-positive rate, and false-positive/negative rate were computed to analyze the performance of the algorithm using a confusion matrix. Results: An artificial intelligence algorithm (CranioCatch, Eskisehir-Turkey) was designed based on R-CNN inception architecture to automatically detect and number the teeth on periapical images. Of 864 teeth in 156 periapical radiographs, 668 were correctly numbered in the test data set. The F1 score, precision, and sensitivity were 0.8720, 0.7812, and 0.9867, respectively. Conclusion: The study demonstrated the potential accuracy and efficiency of the CNN algorithm for detecting and numbering teeth. The deep learning-based methods can help clinicians reduce workloads, improve dental records, and reduce turnaround time for urgent cases. This architecture might also contribute to forensic science.

Download Full-text

CoABCMiner: An Algorithm for Cooperative Rule Classification System Based on Artificial Bee Colony

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213015500281 ◽

2016 ◽

Vol 25 (01) ◽

pp. 1550028 ◽

Cited By ~ 12

Author(s):

Mete Celik ◽

Fehim Koylu ◽

Dervis Karaboga

Keyword(s):

Artificial Bee Colony ◽

Statistical Tests ◽

Rule Learning ◽

Machine Learning Algorithms ◽

Classification Model ◽

Classification Rule ◽

Data Sets ◽

Classification Rules ◽

Data Set ◽

Bee Colony

In data mining, classification rule learning extracts the knowledge in the representation of IF_THEN rule which is comprehensive and readable. It is a challenging problem due to the complexity of data sets. Various meta-heuristic machine learning algorithms are proposed for rule learning. Cooperative rule learning is the discovery process of all classification rules with a single run concurrently. In this paper, a novel cooperative rule learning algorithm, called CoABCMiner, based on Artificial Bee Colony is introduced. The proposed algorithm handles the training data set and discovers the classification model containing the rule list. Token competition, new updating strategy used in onlooker and employed phases, and new scout bee mechanism are proposed in CoABCMiner to achieve cooperative learning of different rules belonging to different classes. We compared the results of CoABCMiner with several state-of-the-art algorithms using 14 benchmark data sets. Non parametric statistical tests, such as Friedman test, post hoc test, and contrast estimation based on medians are performed. Nonparametric tests determine the similarity of control algorithm among other algorithms on multiple problems. Sensitivity analysis of CoABCMiner is conducted. It is concluded that CoABCMiner can be used to discover classification rules for the data sets used in experiments, efficiently.

Download Full-text