Machine Learning for Organic Cage Property Prediction

10.26434/chemrxiv.6995018.v2 ◽

2018 ◽

Author(s):

Lukas Turcani ◽

Rebecca L. Greenaway ◽

Kim Jelfs

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Sets ◽

Cavity Size ◽

Learning Models ◽

Property Prediction ◽

Online Tool ◽

Machine Learning Models

We use machine learning to predict shape persistence and cavity size in porous organic cages. The majority of hypothetical organic cages suffer from a lack of shape persistence and as a result lack intrinsic porosity, rendering them unsuitable for many applications. We have created the largest computational database of these molecules to date, numbering 63,472 cages, formed through a range of reaction chemistries and in multiple topologies. We study our database and identify features which lead to the formation of shape persistent cages. We find that the imine condensation of trialdehydes and diamines in a [4+6] reaction is the most likely to result in shape persistent cages, whereas thiol reactions are most likely to give collapsed cages. Using this database, we develop machine learning models capable of predicting shape persistence with an accuracy of up to 93%, reducing the time taken to predict this property to milliseconds, and removing the need for specialist software. In addition, we develop machine learning models for two other key properties of these molecules, cavity size and symmetry. We provide open-source implementations of our models, together with the accompanying data sets, and an online tool giving users access to our models to easily obtain predictions for a hypothetical cage prior to a synthesis attempt.

Download Full-text

Machine Learning for Organic Cage Property Prediction

10.26434/chemrxiv.6995018.v3 ◽

2018 ◽

Author(s):

Lukas Turcani ◽

Rebecca L. Greenaway ◽

Kim Jelfs

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Sets ◽

Cavity Size ◽

Learning Models ◽

Property Prediction ◽

Online Tool ◽

Machine Learning Models

We use machine learning to predict shape persistence and cavity size in porous organic cages. The majority of hypothetical organic cages suffer from a lack of shape persistence and as a result lack intrinsic porosity, rendering them unsuitable for many applications. We have created the largest computational database of these molecules to date, numbering 63,472 cages, formed through a range of reaction chemistries and in multiple topologies. We study our database and identify features which lead to the formation of shape persistent cages. We find that the imine condensation of trialdehydes and diamines in a [4+6] reaction is the most likely to result in shape persistent cages, whereas thiol reactions are most likely to give collapsed cages. Using this database, we develop machine learning models capable of predicting shape persistence with an accuracy of up to 93%, reducing the time taken to predict this property to milliseconds, and removing the need for specialist software. In addition, we develop machine learning models for two other key properties of these molecules, cavity size and symmetry. We provide open-source implementations of our models, together with the accompanying data sets, and an online tool giving users access to our models to easily obtain predictions for a hypothetical cage prior to a synthesis attempt.

Download Full-text

Machine Learning for Organic Cage Property Prediction

10.26434/chemrxiv.6995018.v1 ◽

2018 ◽

Author(s):

Lukas Turcani ◽

Kim Jelfs

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Sets ◽

Cavity Size ◽

Learning Models ◽

Property Prediction ◽

Online Tool ◽

Machine Learning Models

We use machine learning to predict shape persistence and cavity size in porous organic cages. The majority of hypothetical organic cages suffer from a lack of shape persistence and as a result lack intrinsic porosity, rendering them unsuitable for many applications. We have created the largest computational database of these molecules to date, numbering 63,472 cages, formed through a range of reaction chemistries and in multiple topologies. We study our database and identify features which lead to the formation of shape persistent cages. We find that the imine condensation of trialdehydes and diamines in a [4+6] reaction is the most likely to result in shape persistent cages, whereas thiol reactions are most likely to give collapsed cages. Using this database, we develop machine learning models capable of predicting shape persistence with an accuracy of up to 93%, reducing the time taken to predict this property to milliseconds, and removing the need for specialist software. In addition, we develop machine learning models for two other key properties of these molecules, cavity size and symmetry. We provide open-source implementations of our models, together with the accompanying data sets, and an online tool giving users access to our models to easily obtain predictions for a hypothetical cage prior to a synthesis attempt.

Download Full-text

End-To-End Computer Vision Framework: An Open-Source Platform for Research and Education

Sensors ◽

10.3390/s21113691 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3691

Author(s):

Ciprian Orhei ◽

Silviu Vert ◽

Muguras Mocofan ◽

Radu Vasiu

Keyword(s):

Machine Learning ◽

Image Processing ◽

Computer Vision ◽

Open Source ◽

Visual Processing ◽

Research Field ◽

Learning Models ◽

Research Activity ◽

End To End ◽

Machine Learning Models

Computer Vision is a cross-research field with the main purpose of understanding the surrounding environment as closely as possible to human perception. The image processing systems is continuously growing and expanding into more complex systems, usually tailored to the certain needs or applications it may serve. To better serve this purpose, research on the architecture and design of such systems is also important. We present the End-to-End Computer Vision Framework, an open-source solution that aims to support researchers and teachers within the image processing vast field. The framework has incorporated Computer Vision features and Machine Learning models that researchers can use. In the continuous need to add new Computer Vision algorithms for a day-to-day research activity, our proposed framework has an advantage given by the configurable and scalar architecture. Even if the main focus of the framework is on the Computer Vision processing pipeline, the framework offers solutions to incorporate even more complex activities, such as training Machine Learning models. EECVF aims to become a useful tool for learning activities in the Computer Vision field, as it allows the learner and the teacher to handle only the topics at hand, and not the interconnection necessary for visual processing flow.

Download Full-text

Machine Learning Boosted Docking (HASTEN): An Open-Source Tool To Accelerate Structurebased Virtual Screening Campaigns

10.26434/chemrxiv.14345849 ◽

2021 ◽

Author(s):

Tuomo Kalliokoski

Keyword(s):

Machine Learning ◽

Virtual Screening ◽

Open Source ◽

Learning Models ◽

Open Source Tool ◽

The Mean ◽

Machine Learning Models

The software macHine leArning booSTed dockiNg (HASTEN) was developed to accelerate structure-based virtual screening using machine learning models. It has been validated using datasets both from literature (12 datasets, each containing three million molecules docked with FRED) and in-house sources (one dataset of four million compounds docked with Glide). HASTEN showed reasonable performance by having the mean recall value of 0.78 of the top one percent scoring molecules after docking 10 % of the dataset for the literature data, whereas excellent recall value of 0.95 was achieved for the in-house data. The program can be used with any docking- and machine learning methodology, and is freely available from https://github.com/TuomoKalliokoski/HASTEN.

Download Full-text

Using Machine Learning Methods Incorporating Individual Reader Annotations to Classify Paediatric Chest Radiographs in Epidemiological Studies

Wellcome Open Research ◽

10.12688/wellcomeopenres.17164.1 ◽

2021 ◽

Vol 6 ◽

pp. 309

Author(s):

Paul Mwaniki ◽

Timothy Kamanu ◽

Samuel Akech ◽

M. J. C Eijkemans

Keyword(s):

Machine Learning ◽

Epidemiological Studies ◽

Chest Radiographs ◽

World Health ◽

Data Sets ◽

Learning Models ◽

Middle Income ◽

Training Models ◽

Model Training ◽

Machine Learning Models

Introduction: Epidemiological studies that involve interpretation of chest radiographs (CXRs) suffer from inter-reader and intra-reader variability. Inter-reader and intra-reader variability hinder comparison of results from different studies or centres, which negatively affects efforts to track the burden of chest diseases or evaluate the efficacy of interventions such as vaccines. This study explores machine learning models that could standardize interpretation of CXR across studies and the utility of incorporating individual reader annotations when training models using CXR data sets annotated by multiple readers. Methods: Convolutional neural networks were used to classify CXRs from seven low to middle-income countries into five categories according to the World Health Organization's standardized methodology for interpreting paediatric CXRs. We compared models trained to predict the final/aggregate classification with models trained to predict how each reader would classify an image and then aggregate predictions for all readers using unweighted mean. Results: Incorporating individual reader's annotations during model training improved classification accuracy by 3.4% (multi-class accuracy 61% vs 59%). Model accuracy was higher for children above 12 months of age (68% vs 58%). The accuracy of the models in different countries ranged between 45% and 71%. Conclusions: Machine learning models can annotate CXRs in epidemiological studies reducing inter-reader and intra-reader variability. In addition, incorporating individual reader annotations can improve the performance of machine learning models trained using CXRs annotated by multiple readers.

Download Full-text

Improving Logging Prediction on Imbalanced Datasets

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2016040103 ◽

2016 ◽

Vol 7 (2) ◽

pp. 43-71 ◽

Cited By ~ 3

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Classification and Success Investigation of Biomedical Data Sets Using Supervised Machine Learning Models

2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) ◽

10.1109/ismsit.2019.8932734 ◽

2019 ◽

Author(s):

Sarmad N. Mohammed ◽

Mehmet Serdar Guzel ◽

Erkan Bostanci

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Data Sets ◽

Biomedical Data ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Saga: An Open Source Platform for Training Machine Learning Models and Community-driven Sharing of Techniques

2019 International Conference on Content-Based Multimedia Indexing (CBMI) ◽

10.1109/cbmi.2019.8877455 ◽

2019 ◽

Author(s):

Rune Johan Borgli ◽

Hakon Kvale Stensland ◽

Pal Halvorsen ◽

Michael Alexander Riegler

Keyword(s):

Machine Learning ◽

Open Source ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Arangopipe, a tool for machine learning meta-data management

Data Science ◽

10.3233/ds-210034 ◽

2021 ◽

pp. 1-15

Author(s):

Jörg Schad ◽

Rajiv Sambasivan ◽

Christopher Woodward

Keyword(s):

Machine Learning ◽

Life Cycle ◽

Open Source ◽

Data Model ◽

Application Programming Interface ◽

Learning Models ◽

Essential Components ◽

Application Programming ◽

Programming Interface ◽

Machine Learning Models

Experimenting with different models, documenting results and findings, and repeating these tasks are day-to-day activities for machine learning engineers and data scientists. There is a need to keep control of the machine-learning pipeline and its metadata. This allows users to iterate quickly through experiments and retrieve key findings and observations from historical activity. This is the need that Arangopipe serves. Arangopipe is an open-source tool that provides a data model that captures the essential components of any machine learning life cycle. Arangopipe provides an application programming interface that permits machine-learning engineers to record the details of the salient steps in building their machine learning models. The components of the data model and an overview of the application programming interface is provided. Illustrative examples of basic and advanced machine learning workflows are provided. Arangopipe is not only useful for users involved in developing machine learning models but also useful for users deploying and maintaining them.

Download Full-text