Training Set Fuzzification Towards Prediction Improvement

Author(s):  
Eva Volna ◽  
Jaroslav Zacek ◽  
Robert Jarusek
Keyword(s):  
2011 ◽  
Author(s):  
Jeffrey S. Katz ◽  
John F. Magnotti ◽  
Anthony A. Wright

2020 ◽  
Vol 2020 (10) ◽  
pp. 64-1-64-5
Author(s):  
Mustafa I. Jaber ◽  
Christopher W. Szeto ◽  
Bing Song ◽  
Liudmila Beziaeva ◽  
Stephen C. Benz ◽  
...  

In this paper, we propose a patch-based system to classify non-small cell lung cancer (NSCLC) diagnostic whole slide images (WSIs) into two major histopathological subtypes: adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC). Classifying patients accurately is important for prognosis and therapy decisions. The proposed system was trained and tested on 876 subtyped NSCLC gigapixel-resolution diagnostic WSIs from 805 patients – 664 in the training set and 141 in the test set. The algorithm has modules for: 1) auto-generated tumor/non-tumor masking using a trained residual neural network (ResNet34), 2) cell-density map generation (based on color deconvolution, local drain segmentation, and watershed transformation), 3) patch-level feature extraction using a pre-trained ResNet34, 4) a tower of linear SVMs for different cell ranges, and 5) a majority voting module for aggregating subtype predictions in unseen testing WSIs. The proposed system was trained and tested on several WSI magnifications ranging from x4 to x40 with a best ROC AUC of 0.95 and an accuracy of 0.86 in test samples. This fully-automated histopathology subtyping method outperforms similar published state-of-the-art methods for diagnostic WSIs.


2020 ◽  
Author(s):  
Xin Yi See ◽  
Benjamin Reiner ◽  
Xuelan Wen ◽  
T. Alexander Wheeler ◽  
Channing Klein ◽  
...  

<div> <div> <div> <p>Herein, we describe the use of iterative supervised principal component analysis (ISPCA) in de novo catalyst design. The regioselective synthesis of 2,5-dimethyl-1,3,4-triphenyl-1H- pyrrole (C) via Ti- catalyzed formal [2+2+1] cycloaddition of phenyl propyne and azobenzene was targeted as a proof of principle. The initial reaction conditions led to an unselective mixture of all possible pyrrole regioisomers. ISPCA was conducted on a training set of catalysts, and their performance was regressed against the scores from the top three principal components. Component loadings from this PCA space along with k-means clustering were used to inform the design of new test catalysts. The selectivity of a prospective test set was predicted in silico using the ISPCA model, and only optimal candidates were synthesized and tested experimentally. This data-driven predictive-modeling workflow was iterated, and after only three generations the catalytic selectivity was improved from 0.5 (statistical mixture of products) to over 11 (> 90% C) by incorporating 2,6-dimethyl- 4-(pyrrolidin-1-yl)pyridine as a ligand. The successful development of a highly selective catalyst without resorting to long, stochastic screening processes demonstrates the inherent power of ISPCA in de novo catalyst design and should motivate the general use of ISPCA in reaction development. </p> </div> </div> </div>


2018 ◽  
Author(s):  
Caitlin C. Bannan ◽  
David Mobley ◽  
A. Geoff Skillman

<div>A variety of fields would benefit from accurate pK<sub>a</sub> predictions, especially drug design due to the affect a change in ionization state can have on a molecules physiochemical properties.</div><div>Participants in the recent SAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pK<sub>a</sub>s of 24 drug like small molecules.</div><div>We recently built a general model for predicting pK<sub>a</sub>s using a Gaussian process regression trained using physical and chemical features of each ionizable group.</div><div>Our pipeline takes a molecular graph and uses the OpenEye Toolkits to calculate features describing the removal of a proton.</div><div>These features are fed into a Scikit-learn Gaussian process to predict microscopic pK<sub>a</sub>s which are then used to analytically determine macroscopic pK<sub>a</sub>s.</div><div>Our Gaussian process is trained on a set of 2,700 macroscopic pK<sub>a</sub>s from monoprotic and select diprotic molecules.</div><div>Here, we share our results for microscopic and macroscopic predictions in the SAMPL6 challenge.</div><div>Overall, we ranked in the middle of the pack compared to other participants, but our fairly good agreement with experiment is still promising considering the challenge molecules are chemically diverse and often polyprotic while our training set is predominately monoprotic.</div><div>Of particular importance to us when building this model was to include an uncertainty estimate based on the chemistry of the molecule that would reflect the likely accuracy of our prediction. </div><div>Our model reports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability, along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.</div><div>The challenge highlighted a variety of means to improve our model, including adding more polyprotic molecules to our training set and more carefully considering what functional groups we do or do not identify as ionizable. </div>


Author(s):  
Golokesh Santra ◽  
Nitai Sylvetsky ◽  
Gershom Martin

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.


2019 ◽  
Author(s):  
Golokesh Santra ◽  
Nitai Sylvetsky ◽  
Gershom Martin

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.


2021 ◽  
Vol 12 (2) ◽  
Author(s):  
Mohammad Haekal ◽  
Henki Bayu Seta ◽  
Mayanda Mega Santoni

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.


2015 ◽  
Vol 25 ◽  
pp. 27-35 ◽  
Author(s):  
Tonya Trubshoe ◽  
Bryan Found

The relative ability of forensic document examiners (FDEs) to provide support for the proposition of individualization or exclusion on the basis of handwriting features was investigated by surveying opinions expressed in case files by one laboratory’s FDEs and comparing this data to blind trial test results taken over a five year period. The survey of FDEs opinions on reports showed that opinions were skewed towards support for writer individualization over writer exclusion 92% of the time. Since historically FDEs develop their skills with respect to individualization/exclusion primarily on case files, it is proposed that this unbalanced training context may skew their abilities to carry out the tasks. To determine one laboratory’s capacity to correctly provide both individualization and exclusion evidence, results of blind validation trials were analyzed. For natural writing written and not written by the specimen writer, FDEs were 62 times more inconclusive when providing support for exclusion of the specimen writer when the specimen writer did not author the questioned sample, than they were for providing support for individualization when the specimen writer wrote the questioned sample. An intriguing possibility is that because of the unbalanced training set, government FDEs may acquire skills which are skewed towards individualization over exclusion.   Purchase Article - $10


2016 ◽  
Author(s):  
Frederico dos Santos Liporace ◽  
Ricardo José Machado ◽  
Valmir C. Barbosa

Sign in / Sign up

Export Citation Format

Share Document