A graph neural network approach to molecule carcinogenicity prediction

Molecular carcinogenicity is a preventable cause of cancer, however, most experimental testing of molecular compounds is an expensive and time consuming process, making high throughput experimental approaches infeasible. In recent years, there has been substantial progress in machine learning techniques for molecular property prediction. In this work, we propose a model for carcinogenicity prediction, CONCERTO, which uses a graph transformer in conjunction with a molecular fingerprint representation, trained on multi-round mutagenicity and carcinogenicity objectives. To train and validate CONCERTO, we augment the training dataset with more informative labels and utilize a larger external validation dataset. Extensive experiments demonstrate that our model yields results superior to alternate approaches for molecular carcinogenicity prediction.

Download Full-text

Determining the extent and drivers of attrition losses from wind using long-term datasets and machine learning techniques

Forestry An International Journal of Forest Research ◽

10.1093/forestry/cpy047 ◽

2019 ◽

Vol 92 (4) ◽

pp. 425-435 ◽

Cited By ~ 2

Author(s):

John Moore ◽

Yue Lin

Keyword(s):

Machine Learning ◽

Basal Area ◽

Wind Damage ◽

Machine Learning Techniques ◽

Training Dataset ◽

Validation Dataset ◽

Gradient Boosting ◽

Factors Associated ◽

Learning Techniques

Abstract In addition to causing large-scale catastrophic damage to forests, wind can also cause damage to individual trees or small groups of trees. Over time, the cumulative effect of this wind-induced attrition can result in a significant reduction in yield in managed forests. Better understanding of the extent of these losses and the factors associated with them can aid better forest management. Information on wind damage attrition is often captured in long-term growth monitoring plots but analysing these large datasets to identify factors associated with the damage can be problematic. Machine learning techniques offer the potential to overcome some of the challenges with analysing these datasets. In this study, we applied two commonly-available machine learning algorithms (Random Forests and Gradient Boosting Trees) to a large, long-term dataset of tree growth for radiata pine (Pinus radiata D. Don) in New Zealand containing more than 157 000 observations. Both algorithms identified stand density and height-to-diameter ratio as being the two most important variables associated with the proportion of basal area lost to wind. The algorithms differed in their ease of parameterization and processing time as well as their overall ability to predict wind damage loss. The Random Forest model was able to predict ~43 per cent of the variation in the proportion of basal area lost to wind damage in the training dataset (a random sample of 80 per cent of the original data) and 45 per cent of the validation dataset (the remaining 20 per cent of the data). Conversely, the Gradient Boosting Tree model was able to predict more than 99 per cent of the variation in wind damage loss in the training dataset, but only ~49 per cent of the variation in the validation dataset, which highlights the potential for overfitting models to specific datasets. When applying these techniques to long-term datasets, it is also important to be aware of potential issues with the underlying data such as missing observations resulting from plots being abandoned without measurement when damage levels have been very high.

Download Full-text

Discovery of Highly Polymorphic Organic Materials: A New Machine Learning Approach

10.26434/chemrxiv.9524219 ◽

2019 ◽

Author(s):

Zied Hosni ◽

Annalisa Riccardi ◽

Stephanie Yerdelen ◽

Alan R. G. Martin ◽

Deborah Bowering ◽

...

Keyword(s):

Machine Learning ◽

Structure Prediction ◽

External Validation ◽

New Drugs ◽

Training Dataset ◽

Validation Dataset ◽

Machine Learning Classification ◽

Novel Approach ◽

Physical Form ◽

Machine Learning Approach

<div><div><p>Polymorphism is the capacity of a molecule to adopt different conformations or molecular packing arrangements in the solid state. This is a key property to control during pharmaceutical manufacturing because it can impact a range of properties including stability and solubility. In this study, a novel approach based on machine learning classification methods is used to predict the likelihood for an organic compound to crystallise in multiple forms. A training dataset of drug-like molecules was curated from the Cambridge Structural Database (CSD) and filtered according to entries in the Drug Bank database. The number of separate forms in the CSD for each molecule was recorded. A metaclassifier was trained using this dataset to predict the expected number of crystalline forms from the compound descriptors. This approach was used to estimate the number of crystallographic forms for an external validation dataset. These results suggest this novel methodology can be used to predict the extent of polymorphism of new drugs or not-yet experimentally screened molecules. This promising method complements expensive ab initio methods for crystal structure prediction and as integral to experimental physical form screening, may identify systems that with unexplored potential.</p> </div> </div>

Download Full-text

Intelligent Neural Network Schemes for Multi-Class Classification

Applied Sciences ◽

10.3390/app9194036 ◽

2019 ◽

Vol 9 (19) ◽

pp. 4036 ◽

Cited By ~ 1

Author(s):

You ◽

Wu ◽

Lee ◽

Liu

Keyword(s):

Neural Network ◽

Clustering Algorithm ◽

Classification Problem ◽

Machine Learning Techniques ◽

Training Dataset ◽

Reduction Techniques ◽

Learning Techniques ◽

Benchmark Datasets ◽

Dimensionality Reduction Techniques ◽

Multi Class Classification

Multi-class classification is a very important technique in engineering applications, e.g., mechanical systems, mechanics and design innovations, applied materials in nanotechnologies, etc. A large amount of research is done for single-label classification where objects are associated with a single category. However, in many application domains, an object can belong to two or more categories, and multi-label classification is needed. Traditionally, statistical methods were used; recently, machine learning techniques, in particular neural networks, have been proposed to solve the multi-class classification problem. In this paper, we develop radial basis function (RBF)-based neural network schemes for single-label and multi-label classification, respectively. The number of hidden nodes and the parameters involved with the basis functions are determined automatically by applying an iterative self-constructing clustering algorithm to the given training dataset, and biases and weights are derived optimally by least squares. Dimensionality reduction techniques are adopted and integrated to help reduce the overfitting problem associated with the RBF networks. Experimental results from benchmark datasets are presented to show the effectiveness of the proposed schemes.

Download Full-text

Computer-aided prediction and design of IL-6 inducing peptides: IL-6 plays a crucial role in COVID-19

Briefings in Bioinformatics ◽

10.1093/bib/bbaa259 ◽

2020 ◽

Cited By ~ 2

Author(s):

Anjali Dhall ◽

Sumeet Patiyal ◽

Neelam Sharma ◽

Salman Sadullah Usmani ◽

Gajendra P S Raghava

Keyword(s):

Scientific Community ◽

Prediction Models ◽

Vital Role ◽

Machine Learning Techniques ◽

Validation Dataset ◽

Independent Validation ◽

Immune Epitope ◽

Learning Techniques ◽

Wide Range ◽

Immune Epitope Database

Abstract Interleukin 6 (IL-6) is a pro-inflammatory cytokine that stimulates acute phase responses, hematopoiesis and specific immune reactions. Recently, it was found that the IL-6 plays a vital role in the progression of COVID-19, which is responsible for the high mortality rate. In order to facilitate the scientific community to fight against COVID-19, we have developed a method for predicting IL-6 inducing peptides/epitopes. The models were trained and tested on experimentally validated 365 IL-6 inducing and 2991 non-inducing peptides extracted from the immune epitope database. Initially, 9149 features of each peptide were computed using Pfeature, which were reduced to 186 features using the SVC-L1 technique. These features were ranked based on their classification ability, and the top 10 features were used for developing prediction models. A wide range of machine learning techniques has been deployed to develop models. Random Forest-based model achieves a maximum AUROC of 0.84 and 0.83 on training and independent validation dataset, respectively. We have also identified IL-6 inducing peptides in different proteins of SARS-CoV-2, using our best models to design vaccine against COVID-19. A web server named as IL-6Pred and a standalone package has been developed for predicting, designing and screening of IL-6 inducing peptides (https://webs.iiitd.edu.in/raghava/il6pred/).

Download Full-text

A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography

Scientific Reports ◽

10.1038/s41598-021-95533-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hojjat Salehinejad ◽

Jumpei Kitamura ◽

Noah Ditkofsky ◽

Amy Lin ◽

Aditya Bharatha ◽

...

Keyword(s):

Machine Learning ◽

Medical Imaging ◽

Intracranial Hemorrhage ◽

Real World ◽

External Validation ◽

Model Performance ◽

Training Dataset ◽

Validation Dataset ◽

Great Promise ◽

Clinical Environments

AbstractMachine learning (ML) holds great promise in transforming healthcare. While published studies have shown the utility of ML models in interpreting medical imaging examinations, these are often evaluated under laboratory settings. The importance of real world evaluation is best illustrated by case studies that have documented successes and failures in the translation of these models into clinical environments. A key prerequisite for the clinical adoption of these technologies is demonstrating generalizable ML model performance under real world circumstances. The purpose of this study was to demonstrate that ML model generalizability is achievable in medical imaging with the detection of intracranial hemorrhage (ICH) on non-contrast computed tomography (CT) scans serving as the use case. An ML model was trained using 21,784 scans from the RSNA Intracranial Hemorrhage CT dataset while generalizability was evaluated using an external validation dataset obtained from our busy trauma and neurosurgical center. This real world external validation dataset consisted of every unenhanced head CT scan (n = 5965) performed in our emergency department in 2019 without exclusion. The model demonstrated an AUC of 98.4%, sensitivity of 98.8%, and specificity of 98.0%, on the test dataset. On external validation, the model demonstrated an AUC of 95.4%, sensitivity of 91.3%, and specificity of 94.1%. Evaluating the ML model using a real world external validation dataset that is temporally and geographically distinct from the training dataset indicates that ML generalizability is achievable in medical imaging applications.

Download Full-text

Machine Learning Generalisation across Different 3D Architectural Heritage

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9060379 ◽

2020 ◽

Vol 9 (6) ◽

pp. 379 ◽

Cited By ~ 4

Author(s):

Eleonora Grilli ◽

Fabio Remondino

Keyword(s):

Machine Learning ◽

Point Cloud ◽

Machine Learning Techniques ◽

Training Dataset ◽

High Complexity ◽

Architectural Heritage ◽

Learning Techniques ◽

Machine Learning Model ◽

Point Cloud Classification

The use of machine learning techniques for point cloud classification has been investigated extensively in the last decade in the geospatial community, while in the cultural heritage field it has only recently started to be explored. The high complexity and heterogeneity of 3D heritage data, the diversity of the possible scenarios, and the different classification purposes that each case study might present, makes it difficult to realise a large training dataset for learning purposes. An important practical issue that has not been explored yet, is the application of a single machine learning model across large and different architectural datasets. This paper tackles this issue presenting a methodology able to successfully generalise to unseen scenarios a random forest model trained on a specific dataset. This is achieved looking for the best features suitable to identify the classes of interest (e.g., wall, windows, roof and columns).

Download Full-text

Automatic Classification of Web Images as UML Static Diagrams Using Machine Learning Techniques

Applied Sciences ◽

10.3390/app10072406 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2406

Author(s):

Valentín Moreno ◽

Gonzalo Génova ◽

Manuela Alejandres ◽

Anabel Fraga

Keyword(s):

Machine Learning ◽

Software Reuse ◽

Unified Modeling Language ◽

Machine Learning Techniques ◽

Training Dataset ◽

Unified Modeling ◽

Learning Techniques ◽

Web Images ◽

Time Required ◽

Automated Software

Our purpose in this research is to develop a method to automatically and efficiently classify web images as Unified Modeling Language (UML) static diagrams, and to produce a computer tool that implements this function. The tool receives a bitmap file (in different formats) as an input and communicates whether the image corresponds to a diagram. For pragmatic reasons, we restricted ourselves to the simplest kinds of diagrams that are more useful for automated software reuse: computer-edited 2D representations of static diagrams. The tool does not require that the images are explicitly or implicitly tagged as UML diagrams. The tool extracts graphical characteristics from each image (such as grayscale histogram, color histogram and elementary geometric forms) and uses a combination of rules to classify it. The rules are obtained with machine learning techniques (rule induction) from a sample of 19,000 web images manually classified by experts. In this work, we do not consider the textual contents of the images. Our tool reaches nearly 95% of agreement with manually classified instances, improving the effectiveness of related research works. Moreover, using a training dataset 15 times bigger, the time required to process each image and extract its graphical features (0.680 s) is seven times lower.

Download Full-text

A highly accurate model for screening prostate cancer using propensity index panel of ten genes

10.1101/2021.03.22.436371 ◽

2021 ◽

Author(s):

Shipra Jain ◽

Kawal Preet Kaur Malhotra ◽

Sumeet Patiyal ◽

Gajendra P.S. Raghava

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Single Gene ◽

Specific Antigen ◽

High Accuracy ◽

Machine Learning Techniques ◽

Validation Dataset ◽

New Approach ◽

Learning Techniques ◽

Feature Selection Techniques

Prostate-specific antigen (PSA) is a key biomarker, which is commonly used to screen patients of prostate cancer. There is a significant number of unnecessary biopsies that are performed every year, due to poor accuracy of PSA based biomarker. In this study, we identified alternate biomarkers based on gene expression that can be used to screen prostate cancer with high accuracy. All models were trained and test on gene expression profile of 500 prostate cancer and 51 normal samples. Numerous feature selection techniques have been used to identify potential biomarkers. These biomarkers have been used to develop various models using different machine learning techniques for predicting samples of prostate cancer. Our logistic regression-based model achieved highest AUROC 0.91 with accuracy 82.42% on validation dataset. We introduced a new approach called propensity index, where expression of gene is converted into propensity. Our propensity based approach improved the performance of classification models significantly and achieved AUROC 0.99 with accuracy 96.36% on validation dataset. We also identified and ranked selected genes which can be used to discriminate prostate cancer patients from health individuals with high accuracy. It was observed that single gene based biomarkers can only achieve accuracy around 90%. In this study, we got best performance using a panel of 10 genes; random forest model using propensity index.

Download Full-text

Classical and Deep Learning Paradigms for Detection and Validation of Key Genes of Risky Outcomes of HCV

Algorithms ◽

10.3390/a13030073 ◽

2020 ◽

Vol 13 (3) ◽

pp. 73

Author(s):

Nagwan M. Abdel Samee

Keyword(s):

Hepatic Cirrhosis ◽

Principal Component ◽

Machine Learning Techniques ◽

Classification Algorithms ◽

Second Phase ◽

Selection Methods ◽

Neural Network Approach ◽

Learning Techniques ◽

Key Genes ◽

Two Phases

Hepatitis C virus (HCV) is one of the most dangerous viruses worldwide. It is the foremost cause of the hepatic cirrhosis, and hepatocellular carcinoma, HCC. Detecting new key genes that play a role in the growth of HCC in HCV patients using machine learning techniques paves the way for producing accurate antivirals. In this work, there are two phases: detecting the up/downregulated genes using classical univariate and multivariate feature selection methods, and validating the retrieved list of genes using Insilico classifiers. However, the classification algorithms in the medical domain frequently suffer from a deficiency of training cases. Therefore, a deep neural network approach is proposed here to validate the significance of the retrieved genes in classifying the HCV-infected samples from the disinfected ones. The validation model is based on the artificial generation of new examples from the retrieved genes’ expressions using sparse autoencoders. Subsequently, the generated genes’ expressions data are used to train conventional classifiers. Our results in the first phase yielded a better retrieval of significant genes using Principal Component Analysis (PCA), a multivariate approach. The retrieved list of genes using PCA had a higher number of HCC biomarkers compared to the ones retrieved from the univariate methods. In the second phase, the classification accuracy can reveal the relevance of the extracted key genes in classifying the HCV-infected and disinfected samples.

Download Full-text

Predicting the long-term stability of compact multiplanet systems

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2001258117 ◽

2020 ◽

Vol 117 (31) ◽

pp. 18194-18205 ◽

Cited By ~ 2

Author(s):

Daniel Tamayo ◽

Miles Cranmer ◽

Samuel Hadden ◽

Hanno Rein ◽

Peter Battaglia ◽

...

Keyword(s):

Machine Learning Techniques ◽

Training Dataset ◽

Complementary Method ◽

Long Term Stability ◽

Learning Techniques ◽

Resonant Dynamics ◽

The Stability ◽

Long Timescales

We combine analytical understanding of resonant dynamics in two-planet systems with machine-learning techniques to train a model capable of robustly classifying stability in compact multiplanet systems over long timescales of109orbits. Our Stability of Planetary Orbital Configurations Klassifier (SPOCK) predicts stability using physically motivated summary statistics measured in integrations of the first104orbits, thus achieving speed-ups of up to105over full simulations. This computationally opens up the stability-constrained characterization of multiplanet systems. Our model, trained on ∼100,000 three-planet systems sampled at discrete resonances, generalizes both to a sample spanning a continuous period-ratio range, as well as to a large five-planet sample with qualitatively different configurations to our training dataset. Our approach significantly outperforms previous methods based on systems’ angular momentum deficit, chaos indicators, and parametrized fits to numerical integrations. We use SPOCK to constrain the free eccentricities between the inner and outer pairs of planets in the Kepler-431 system of three approximately Earth-sized planets to both be below 0.05. Our stability analysis provides significantly stronger eccentricity constraints than currently achievable through either radial velocity or transit-duration measurements for small planets and within a factor of a few of systems that exhibit transit-timing variations (TTVs). Given that current exoplanet-detection strategies now rarely allow for strong TTV constraints [S. Hadden, T. Barclay, M. J. Payne, M. J. Holman,Astrophys. J.158, 146 (2019)], SPOCK enables a powerful complementary method for precisely characterizing compact multiplanet systems. We publicly release SPOCK for community use.

Download Full-text