scholarly journals NP-Scout: Machine Learning Approach for the Quantification and Visualization of the Natural Product-Likeness of Small Molecules

Biomolecules ◽  
2019 ◽  
Vol 9 (2) ◽  
pp. 43 ◽  
Author(s):  
Ya Chen ◽  
Conrad Stork ◽  
Steffen Hirte ◽  
Johannes Kirchmair

Natural products (NPs) remain the most prolific resource for the development of small-molecule drugs. Here we report a new machine learning approach that allows the identification of natural products with high accuracy. The method also generates similarity maps, which highlight atoms that contribute significantly to the classification of small molecules as a natural product or synthetic molecule. The method can hence be utilized to (i) identify natural products in large molecular libraries, (ii) quantify the natural product-likeness of small molecules, and (iii) visualize atoms in small molecules that are characteristic of natural products or synthetic molecules. The models are based on random forest classifiers trained on data sets consisting of more than 265,000 to 322,000 natural products and synthetic molecules. Two-dimensional molecular descriptors, MACCS keys and Morgan2 fingerprints were explored. On an independent test set the models reached areas under the receiver operating characteristic curve (AUC) of 0.997 and Matthews correlation coefficients (MCCs) of 0.954 and higher. The method was further tested on data from the Dictionary of Natural Products, ChEMBL and other resources. The best-performing models are accessible as a free web service at http://npscout.zbh.uni-hamburg.de/npscout.

2001 ◽  
Vol 27 (4) ◽  
pp. 521-544 ◽  
Author(s):  
Wee Meng Soon ◽  
Hwee Tou Ng ◽  
Daniel Chung Yong Lim

In this paper, we present a learning approach to coreference resolution of noun phrases in unrestricted text. The approach learns from a small, annotated corpus and the task includes resolving not just a certain type of noun phrase (e.g., pronouns) but rather general noun phrases. It also does not restrict the entity types of the noun phrases; that is, coreference is assigned whether they are of “organization,” “person,” or other types. We evaluate our approach on common data sets (namely, the MUC-6 and MUC-7 coreference corpora) and obtain encouraging results, indicating that on the general noun phrase coreference task, the learning approach holds promise and achieves accuracy comparable to that of nonlearning approaches. Our system is the first learning-based system that offers performance comparable to that of state-of-the-art nonlearning systems on these data sets.


2021 ◽  
Author(s):  
Diti Roy ◽  
Md. Ashiq Mahmood ◽  
Tamal Joyti Roy

<p>Heart Disease is the most dominating disease which is taking a large number of deaths every year. A report from WHO in 2016 portrayed that every year at least 17 million people die of heart disease. This number is gradually increasing day by day and WHO estimated that this death toll will reach the summit of 75 million by 2030. Despite having modern technology and health care system predicting heart disease is still beyond limitations. As the Machine Learning algorithm is a vital source predicting data from available data sets we have used a machine learning approach to predict heart disease. We have collected data from the UCI repository. In our study, we have used Random Forest, Zero R, Voted Perceptron, K star classifier. We have got the best result through the Random Forest classifier with an accuracy of 97.69.<i><b></b></i></p> <p><b> </b></p>


PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241239
Author(s):  
Kai On Wong ◽  
Osmar R. Zaïane ◽  
Faith G. Davis ◽  
Yutaka Yasui

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.


2020 ◽  
Author(s):  
Mareen Lösing ◽  
Jörg Ebbing ◽  
Wolfgang Szwillus

&lt;p&gt;Improving the understanding of geothermal heat flux in Antarctica is crucial for ice-sheet modelling and glacial isostatic adjustment. It affects the ice rheology and can lead to basal melting, thereby promoting ice flow. Direct measurements are sparse and models inferred from e.g. magnetic or seismological data differ immensely. By Bayesian inversion, we evaluated the uncertainties of some of these models and studied the interdependencies of the thermal parameters. In contrast to previous studies, our method allows the parameters to vary laterally, which leads to a heterogeneous West- and a slightly more homogeneous East Antarctica with overall lower surface heat flux. The Curie isotherm depth and radiogenic heat production have the strongest impact on our results but both parameters have a high uncertainty.&lt;/p&gt;&lt;p&gt;To overcome such shortcomings, we adopt a machine learning approach, more specifically a Gradient Boosted Regression Tree model, in order to find an optimal predictor for locations with sparse measurements. However, this approach largely relies on global data sets, which are notoriously unreliable in Antarctica. Therefore, validity and quality of the data sets is reviewed and discussed. Using regional and more detailed data sets of Antarctica&amp;#8217;s Gondwana neighbors might improve the predictions due to their similar tectonic history. The performance of the machine learning algorithm can then be examined by comparing the predictions to the existing measurements. From our study, we expect to get new insights in the geothermal structure of Antarctica, which will help with future studies on the coupling of Solid Earth and Cryosphere.&lt;/p&gt;


2020 ◽  
pp. 1-11
Author(s):  
Wicher A. Bokma ◽  
Paul Zhutovsky ◽  
Erik J. Giltay ◽  
Robert A. Schoevers ◽  
Brenda W.J.H. Penninx ◽  
...  

Abstract Background Disease trajectories of patients with anxiety disorders are highly diverse and approximately 60% remain chronically ill. The ability to predict disease course in individual patients would enable personalized management of these patients. This study aimed to predict recovery from anxiety disorders within 2 years applying a machine learning approach. Methods In total, 887 patients with anxiety disorders (panic disorder, generalized anxiety disorder, agoraphobia, or social phobia) were selected from a naturalistic cohort study. A wide array of baseline predictors (N = 569) from five domains (clinical, psychological, sociodemographic, biological, lifestyle) were used to predict recovery from anxiety disorders and recovery from all common mental disorders (CMDs: anxiety disorders, major depressive disorder, dysthymia, or alcohol dependency) at 2-year follow-up using random forest classifiers (RFCs). Results At follow-up, 484 patients (54.6%) had recovered from anxiety disorders. RFCs achieved a cross-validated area-under-the-receiving-operator-characteristic-curve (AUC) of 0.67 when using the combination of all predictor domains (sensitivity: 62.0%, specificity 62.8%) for predicting recovery from anxiety disorders. Classification of recovery from CMDs yielded an AUC of 0.70 (sensitivity: 64.6%, specificity: 62.3%) when using all domains. In both cases, the clinical domain alone provided comparable performances. Feature analysis showed that prediction of recovery from anxiety disorders was primarily driven by anxiety features, whereas recovery from CMDs was primarily driven by depression features. Conclusions The current study showed moderate performance in predicting recovery from anxiety disorders over a 2-year follow-up for individual patients and indicates that anxiety features are most indicative for anxiety improvement and depression features for improvement in general.


2021 ◽  
Author(s):  
Diti Roy ◽  
Md. Ashiq Mahmood ◽  
Tamal Joyti Roy

<p>Heart Disease is the most dominating disease which is taking a large number of deaths every year. A report from WHO in 2016 portrayed that every year at least 17 million people die of heart disease. This number is gradually increasing day by day and WHO estimated that this death toll will reach the summit of 75 million by 2030. Despite having modern technology and health care system predicting heart disease is still beyond limitations. As the Machine Learning algorithm is a vital source predicting data from available data sets we have used a machine learning approach to predict heart disease. We have collected data from the UCI repository. In our study, we have used Random Forest, Zero R, Voted Perceptron, K star classifier. We have got the best result through the Random Forest classifier with an accuracy of 97.69.<i><b></b></i></p> <p><b> </b></p>


PLoS ONE ◽  
2018 ◽  
Vol 13 (9) ◽  
pp. e0204644 ◽  
Author(s):  
Samuel Egieyeh ◽  
James Syce ◽  
Sarel F. Malan ◽  
Alan Christoffels

2021 ◽  
Author(s):  
Nobonita Saha ◽  
Aninda Mohanta ◽  
Jannatun Tuba Jyoti ◽  
Tamal Joyti Roy ◽  
Diti Roy

We have collected two data sets. First data set consisted of 45 thousand data and second one 43. One data set consisted of food information , like calorie count, sugar in per 100 gram, fat in per 100 gram and so on. Second data set consisted of Obesity rate among USA people from age 0 to 80. We wanted to show a relation with sugar intake and obesity rate. Last of all our experiment found that ther's a significance evidence that there's a link between obesity and sugar intake . We used the machine learning approach for our experimental analysis.


Author(s):  
Tsehay Admassu Assegie ◽  
Pramod Sekharan Nair

Handwritten digits recognition is an area of machine learning, in which a machine is trained to identify handwritten digits. One method of achieving this is with decision tree classification model. A decision tree classification is a machine learning approach that uses the predefined labels from the past known sets to determine or predict the classes of the future data sets where the class labels are unknown. In this paper we have used the standard kaggle digits dataset for recognition of handwritten digits using a decision tree classification approach. And we have evaluated the accuracy of the model against each digit from 0 to 9.


Sign in / Sign up

Export Citation Format

Share Document