The use of machine learning methods in the development of nasal dosage forms with cerebroprotective action

In order to save resource of active pharmaceutical ingredients and excipients, in the early stages of research, when planning an experiment, it is advisable to use data of the predicted and experimental physicochemical properties stored in different aggregation databases. The information found will reduce the time for composition development and for technology processing. However, the variety of active compounds characteristics and excipients is not always reflected in these services. Recently, machine learning models have been widely used in various scientific fields; they allow to obtain predictions with high reliability. Given the above, it is relevant and promising to develop models of machine learning to predict the presence of pharmaceutical incompatibilities in the formulation of nasal dosage forms. The aim of the study is to develop models of machine learning for in silico forecast of the rational composition of nasal dosage forms with cerebroprotective action. Materials and methods. A dataset, containing data on compounds (active and auxiliary) and characteristics on the presence or absence of interaction (pharmaceutical incompatibility), was used as material. Training datasets were filled by content analysis of PubMed library data (pubmed.ncbi.nlm.nih.gov) manually, by keywords “pharmaceutical incompatibilities”, “physico-chemical compatibility”, “incompatible excipients”) for the last 10 years. The resulting dataset comprises 1185 lines. The methods employed were a set of methods for binary classification of machine learning (pycaret.org) using the programming language Python 3.8 (python.org) in the package management environment Miniconda (conda.io). Pipeline programming was performed using Jupyter notebook package (jupyter.org). The generation of MACCS (Molecular ACCess System keys) in the training dataset was performed using RDKit package (rdkit.org). Specifications of the simplified representation of molecules in the input line (SMILES), in automatic mode, were searched using PubChem service (pubchem.ncbi.nlm.nih.gov). Results. The obtained data allowed to choose two perspective models of machine learning of binary classification, whose quality was checked on a dataset for verification. Statistical evaluations of the selected models indicate a high probability of in silico prognosis for the presence or absence of pharmaceutical incompatibilities in the development of nasal formulations of cerebroprotective dosage forms. They are posted on the web server of the expert system ExpSys Nasalia (nasalia.zsmu.zp.ua) in the calculations section. Conclusions. As a result of our research, we have developed machine learning models for in silico prediction of the rational composition of nasal dosage forms with cerebroprotective action. Confirmation of the quality of the pharmaceutical incompatibilities prediction, using the developed models, is checked on a dataset for check. The statistical indicators of the tree_blender (AUC 0.9521, F1 0.9747, MCC 0.9094) and boost_blender (AUC 0.9593, F1 0.9821, MCC 0.9352) models were obtained. The use of machine learning models in pharmaceutical development will contribute to resource conservation and optimization of the composition of the formulation.

Download Full-text

Cocrystal Prediction Using Machine Learning Models and Descriptors

Applied Sciences ◽

10.3390/app11031323 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1323

Author(s):

Medard Edmund Mswahili ◽

Min-Jeong Lee ◽

Gati Lother Martin ◽

Junghyun Kim ◽

Paul Kim ◽

...

Keyword(s):

Machine Learning ◽

Academic Research ◽

Pharmaceutical Research ◽

Machine Learning Techniques ◽

Learning Models ◽

Pharmaceutical Ingredients ◽

Learning Techniques ◽

Comparable Performance ◽

Selection Algorithms ◽

Machine Learning Models

Cocrystals are of much interest in industrial application as well as academic research, and screening of suitable coformers for active pharmaceutical ingredients is the most crucial and challenging step in cocrystal development. Recently, machine learning techniques are attracting researchers in many fields including pharmaceutical research such as quantitative structure-activity/property relationship. In this paper, we develop machine learning models to predict cocrystal formation. We extract descriptor values from simplified molecular-input line-entry system (SMILES) of compounds and compare the machine learning models by experiments with our collected data of 1476 instances. As a result, we found that artificial neural network shows great potential as it has the best accuracy, sensitivity, and F1 score. We also found that the model achieved comparable performance with about half of the descriptors chosen by feature selection algorithms. We believe that this will contribute to faster and more accurate cocrystal development.

Download Full-text

Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis

Diagnostics ◽

10.3390/diagnostics12010040 ◽

2021 ◽

Vol 12 (1) ◽

pp. 40

Author(s):

Meike Nauta ◽

Ricky Walsh ◽

Adam Dubowski ◽

Christin Seifert

Keyword(s):

Machine Learning ◽

Clinical Practice ◽

Skin Cancer ◽

Cancer Diagnosis ◽

Image Inpainting ◽

Relevant Information ◽

Black Box ◽

Training Dataset ◽

Learning Models ◽

Machine Learning Models

Machine learning models have been successfully applied for analysis of skin images. However, due to the black box nature of such deep learning models, it is difficult to understand their underlying reasoning. This prevents a human from validating whether the model is right for the right reasons. Spurious correlations and other biases in data can cause a model to base its predictions on such artefacts rather than on the true relevant information. These learned shortcuts can in turn cause incorrect performance estimates and can result in unexpected outcomes when the model is applied in clinical practice. This study presents a method to detect and quantify this shortcut learning in trained classifiers for skin cancer diagnosis, since it is known that dermoscopy images can contain artefacts. Specifically, we train a standard VGG16-based skin cancer classifier on the public ISIC dataset, for which colour calibration charts (elliptical, coloured patches) occur only in benign images and not in malignant ones. Our methodology artificially inserts those patches and uses inpainting to automatically remove patches from images to assess the changes in predictions. We find that our standard classifier partly bases its predictions of benign images on the presence of such a coloured patch. More importantly, by artificially inserting coloured patches into malignant images, we show that shortcut learning results in a significant increase in misdiagnoses, making the classifier unreliable when used in clinical practice. With our results, we, therefore, want to increase awareness of the risks of using black box machine learning models trained on potentially biased datasets. Finally, we present a model-agnostic method to neutralise shortcut learning by removing the bias in the training dataset by exchanging coloured patches with benign skin tissue using image inpainting and re-training the classifier on this de-biased dataset.

Download Full-text

Prediction of Oral Bioavailability in Rats: Transferring Insights from in Vitro Correlations to (Deep) Machine Learning Models Using in Silico Model Outputs and Chemical Structure Parameters

Journal of Chemical Information and Modeling ◽

10.1021/acs.jcim.9b00460 ◽

2019 ◽

Vol 59 (11) ◽

pp. 4893-4905 ◽

Cited By ~ 7

Author(s):

Sebastian Schneckener ◽

Sergio Grimbs ◽

Jessica Hey ◽

Stephan Menz ◽

Maren Osmers ◽

...

Keyword(s):

Machine Learning ◽

Oral Bioavailability ◽

In Silico ◽

Chemical Structure ◽

Structure Parameters ◽

Learning Models ◽

In Silico Model ◽

Machine Learning Models

Download Full-text

Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

AJIT-e Online Academic Journal of Information Technology ◽

10.5824/ajite.2020.01.001.x ◽

2020 ◽

Vol 11 (40) ◽

pp. 8-23

Author(s):

Pius MARTHIN ◽

Duygu İÇEN

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Semantic Analysis ◽

Classification Tree ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Learning Models ◽

Machine Learning Models

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review dataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model selection was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.

Download Full-text

Benchmarking machine learning models for the analysis of genetic data using FRESA.CAD Binary Classification Benchmarking

10.1101/733675 ◽

2019 ◽

Author(s):

Javier de Velasco Oriol ◽

Antonio Martinez-Torteya ◽

Victor Trevino ◽

Israel Alanis ◽

Edgar E. Vallejo ◽

...

Keyword(s):

Machine Learning ◽

Model Selection ◽

Binary Classification ◽

Genetic Data ◽

R Package ◽

Learning Models ◽

Classification Problems ◽

Machine Learning Methods ◽

Computational Perspective ◽

Machine Learning Models

AbstractBackgroundMachine learning models have proven to be useful tools for the analysis of genetic data. However, with the availability of a wide variety of such methods, model selection has become increasingly difficult, both from the human and computational perspective.ResultsWe present the R package FRESA.CAD Binary Classification Benchmarking that performs systematic comparisons between a collection of representative machine learning methods for solving binary classification problems on genetic datasets.ConclusionsFRESA.CAD Binary Benchmarking demonstrates to be a useful tool over a variety of binary classification problems comprising the analysis of genetic data showing both quantitative and qualitative advantages over similar packages.

Download Full-text

Explaining and avoiding failures modes in goal-directed generation

10.33774/chemrxiv-2021-4m6b3-v2 ◽

2021 ◽

Author(s):

Maxime Langevin ◽

Rodolphe Vuilleumier ◽

Marc Bianciotto

Keyword(s):

Machine Learning ◽

Predictive Models ◽

Optimization Model ◽

In Silico ◽

Molecular Design ◽

Data Distribution ◽

Learning Models ◽

Control Models ◽

Machine Learning Models

Despite growing interest and success in automated in-silico molecular design, doubts remain regarding the ability of goal-directed generation algorithms to perform unbiased exploration of novel chemical spaces. A specific phenomenon has recently been highlighted: goal-directed generation guided with machine learning models produce molecules with high scores according to the optimization model, but low scores according to control models, even when trained on the same data distribution and the same target. In this work, we show that this worrisome behavior is actually due to issues with the predictive models and not the goal-directed generation algorithms. We show that with appropriate predictive models, this issue can be resolved, and molecules generated have high scores according to both the optimization and the control models.

Download Full-text

Predicting Anesthetic Infusion Events Using Machine Learning

10.21203/rs.3.rs-783161/v1 ◽

2021 ◽

Author(s):

Naoki Miyaguchi ◽

Koh Takeuchi ◽

Hisashi Kashima ◽

Mizuki Morita ◽

Hiroshi Morimatsu

Keyword(s):

Machine Learning ◽

Flow Rate ◽

Short Term Memory ◽

Binary Classification ◽

Classification Problem ◽

Clinical Findings ◽

Support Vector ◽

Learning Models ◽

Continuous Administration ◽

Machine Learning Models

Abstract Recently, research has been conducted to automatically control anesthesia using machine learning, with the aim of alleviating the shortage of anesthesiologists. In this study, we address the problem of predicting decisions made by anesthesiologists during surgery using machine learning; specifically, we formulate a decision making problem by increasing the flow rate at each time point in the continuous administration of analgesic remifentanil as a supervised binary classification problem. The experiments were conducted to evaluate the prediction performance using six machine learning models: logistic regression, support vector machine, random forest, LightGBM, artificial neural network, and long short-term memory (LSTM), using 210 case data collected during actual surgeries. The results demonstrated that when predicting the future increase in flow rate of remifentanil after 1 min, the model using LSTM was able to predict with scores of 0.659 for sensitivity, 0.732 for specificity, and 0.753 for ROC-AUC; this demonstrates the potential to predict the decisions made by anesthesiologists using machine learning. Furthermore, we examined the importance and contribution of the features of each model using shapley additive explanations—a method for interpreting predictions made by machine learning models. The trends indicated by the results were partially consistent with known clinical findings.

Download Full-text

Detecting Arsenic Contamination Using Satellite Imagery and Machine Learning

Toxics ◽

10.3390/toxics9120333 ◽

2021 ◽

Vol 9 (12) ◽

pp. 333

Author(s):

Ayush Agrawal ◽

Mark R. Petersen

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Mean Squared Error ◽

Binary Classification ◽

Arsenic Concentration ◽

Arsenic Contamination ◽

Hyperspectral Data ◽

Detection Methods ◽

Learning Models ◽

Machine Learning Models

Arsenic, a potent carcinogen and neurotoxin, affects over 200 million people globally. Current detection methods are laborious, expensive, and unscalable, being difficult to implement in developing regions and during crises such as COVID-19. This study attempts to determine if a relationship exists between soil’s hyperspectral data and arsenic concentration using NASA’s Hyperion satellite. It is the first arsenic study to use satellite-based hyperspectral data and apply a classification approach. Four regression machine learning models are tested to determine this correlation in soil with bare land cover. Raw data are converted to reflectance, problematic atmospheric influences are removed, characteristic wavelengths are selected, and four noise reduction algorithms are tested. The combination of data augmentation, Genetic Algorithm, Second Derivative Transformation, and Random Forest regression (R2=0.840 and normalized root mean squared error (re-scaled to [0,1]) = 0.122) shows strong correlation, performing better than past models despite using noisier satellite data (versus lab-processed samples). Three binary classification machine learning models are then applied to identify high-risk shrub-covered regions in ten U.S. states, achieving strong accuracy (=0.693) and F1-score (=0.728). Overall, these results suggest that such a methodology is practical and can provide a sustainable alternative to arsenic contamination detection.

Download Full-text

The Adequacy Assessment of Test Sets in Machine Learning using Mutation Testing

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1183.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 4390-4395

Keyword(s):

Machine Learning ◽

Research Question ◽

Training Dataset ◽

Data Repository ◽

Learning Models ◽

Data Set ◽

Test Dataset ◽

Test Sets ◽

Adequacy Assessment ◽

Machine Learning Models

The accuracy is computed by applying the test dataset to the model that has been trained using the training dataset. Thus, The test dataset in machine learning is expected to be able to validate whether a trained model is sufficiently accurate for use. This study addresses this issue in the form of the research question, “how adequate is the test dataset used in machine learning models to validate the models.” To answer this question, the study takes seven most-popular datasets registered in the UCI machine learning data repository, and applies the data sets to the six difference machine learning models. We do an empirical study to analyze how adequate the test sets are, which are used in validating machine learning models. The testing adequacy for each model and each data set is analyzed by mutation analysis technique.

Download Full-text

An Explainable Machine Learning Model for Material Backorder Prediction in Inventory Management

Sensors ◽

10.3390/s21237926 ◽

2021 ◽

Vol 21 (23) ◽

pp. 7926

Author(s):

Charis Ntakolia ◽

Christos Kokkotis ◽

Patrik Karlsson ◽

Serafeim Moustakidis

Keyword(s):

Machine Learning ◽

Supply Chain ◽

Inventory Management ◽

Historical Data ◽

Binary Classification ◽

Production Costs ◽

Correct Prediction ◽

Learning Models ◽

Future Production ◽

Machine Learning Models

Global competition among businesses imposes a more effective and low-cost supply chain allowing firms to provide products at a desired quality, quantity, and time, with lower production costs. The latter include holding cost, ordering cost, and backorder cost. Backorder occurs when a product is temporarily unavailable or out of stock and the customer places an order for future production and shipment. Therefore, stock unavailability and prolonged delays in product delivery will lead to additional production costs and unsatisfied customers, respectively. Thus, it is of high importance to develop models that will effectively predict the backorder rate in an inventory system with the aim of improving the effectiveness of the supply chain and, consequentially, the performance of the company. However, traditional approaches in the literature are based on stochastic approximation, without incorporating information from historical data. To this end, machine learning models should be employed for extracting knowledge of large historical data to develop predictive models. Therefore, to cover this need, in this study, the backorder prediction problem was addressed. Specifically, various machine learning models were compared for solving the binary classification problem of backorder prediction, followed by model calibration and a post-hoc explainability based on the SHAP model to identify and interpret the most important features that contribute to material backorder. The results showed that the RF, XGB, LGBM, and BB models reached an AUC score of 0.95, while the best-performing model was the LGBM model after calibration with the Isotonic Regression method. The explainability analysis showed that the inventory stock of a product, the volume of products that can be delivered, the imminent demand (sales), and the accurate prediction of the future demand can significantly contribute to the correct prediction of backorders.

Download Full-text