Training Set Fuzzification Towards Prediction Improvement

Mental Health Interpreter training set

PsycEXTRA Dataset ◽

10.1037/e402172008-012 ◽

2004 ◽

Keyword(s):

Mental Health ◽

Training Set ◽

Interpreter Training

Download Full-text

Training Set Size and Response Location Effects on Same/Different Judgments in Humans

PsycEXTRA Dataset ◽

10.1037/e520602012-170 ◽

2011 ◽

Author(s):

Jeffrey S. Katz ◽

John F. Magnotti ◽

Anthony A. Wright

Keyword(s):

Training Set ◽

Response Location ◽

Set Size

Download Full-text

Pathology image-based lung cancer subtyping using deeplearning features and cell-density maps

Electronic Imaging ◽

10.2352/issn.2470-1173.2020.10.ipas-064 ◽

2020 ◽

Vol 2020 (10) ◽

pp. 64-1-64-5

Author(s):

Mustafa I. Jaber ◽

Christopher W. Szeto ◽

Bing Song ◽

Liudmila Beziaeva ◽

Stephen C. Benz ◽

...

Keyword(s):

Lung Cancer ◽

Cell Density ◽

Majority Voting ◽

Training Set ◽

Density Maps ◽

Color Deconvolution ◽

Map Generation ◽

Density Map ◽

Pathology Image ◽

Whole Slide Images

In this paper, we propose a patch-based system to classify non-small cell lung cancer (NSCLC) diagnostic whole slide images (WSIs) into two major histopathological subtypes: adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC). Classifying patients accurately is important for prognosis and therapy decisions. The proposed system was trained and tested on 876 subtyped NSCLC gigapixel-resolution diagnostic WSIs from 805 patients – 664 in the training set and 141 in the test set. The algorithm has modules for: 1) auto-generated tumor/non-tumor masking using a trained residual neural network (ResNet34), 2) cell-density map generation (based on color deconvolution, local drain segmentation, and watershed transformation), 3) patch-level feature extraction using a pre-trained ResNet34, 4) a tower of linear SVMs for different cell ranges, and 5) a majority voting module for aggregating subtype predictions in unseen testing WSIs. The proposed system was trained and tested on several WSI magnifications ranging from x4 to x40 with a best ROC AUC of 0.95 and an accuracy of 0.86 in test samples. This fully-automated histopathology subtyping method outperforms similar published state-of-the-art methods for diagnostic WSIs.

Download Full-text

Iterative Supervised Principal Component Analysis-Driven Ligand Design for Regioselective Ti-Catalyzed Pyrrole Synthesis

10.26434/chemrxiv.12284378 ◽

2020 ◽

Author(s):

Xin Yi See ◽

Benjamin Reiner ◽

Xuelan Wen ◽

T. Alexander Wheeler ◽

Channing Klein ◽

...

Keyword(s):

Principal Component Analysis ◽

De Novo ◽

Principal Component ◽

Component Analysis ◽

Catalyst Design ◽

Data Driven ◽

Initial Reaction ◽

Training Set ◽

Reaction Conditions ◽

Component Loadings

<div> <div> <div> <p>Herein, we describe the use of iterative supervised principal component analysis (ISPCA) in de novo catalyst design. The regioselective synthesis of 2,5-dimethyl-1,3,4-triphenyl-1H- pyrrole (C) via Ti- catalyzed formal [2+2+1] cycloaddition of phenyl propyne and azobenzene was targeted as a proof of principle. The initial reaction conditions led to an unselective mixture of all possible pyrrole regioisomers. ISPCA was conducted on a training set of catalysts, and their performance was regressed against the scores from the top three principal components. Component loadings from this PCA space along with k-means clustering were used to inform the design of new test catalysts. The selectivity of a prospective test set was predicted in silico using the ISPCA model, and only optimal candidates were synthesized and tested experimentally. This data-driven predictive-modeling workflow was iterated, and after only three generations the catalytic selectivity was improved from 0.5 (statistical mixture of products) to over 11 (> 90% C) by incorporating 2,6-dimethyl- 4-(pyrrolidin-1-yl)pyridine as a ligand. The successful development of a highly selective catalyst without resorting to long, stochastic screening processes demonstrates the inherent power of ISPCA in de novo catalyst design and should motivate the general use of ISPCA in reaction development. </p> </div> </div> </div>

Download Full-text

SAMPL6 Challenge Results from pKa Predictions Based on a General Gaussian Process Model

10.26434/chemrxiv.6406505.v2 ◽

2018 ◽

Author(s):

Caitlin C. Bannan ◽

David Mobley ◽

A. Geoff Skillman

Keyword(s):

Gaussian Process ◽

Process Model ◽

Molecular Graph ◽

Gaussian Process Regression ◽

Ionization State ◽

Training Set ◽

Physiochemical Properties ◽

Quantile Plots ◽

Physical And Chemical ◽

Good Agreement

<div>A variety of fields would benefit from accurate pK<sub>a</sub> predictions, especially drug design due to the affect a change in ionization state can have on a molecules physiochemical properties.</div><div>Participants in the recent SAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pK<sub>a</sub>s of 24 drug like small molecules.</div><div>We recently built a general model for predicting pK<sub>a</sub>s using a Gaussian process regression trained using physical and chemical features of each ionizable group.</div><div>Our pipeline takes a molecular graph and uses the OpenEye Toolkits to calculate features describing the removal of a proton.</div><div>These features are fed into a Scikit-learn Gaussian process to predict microscopic pK<sub>a</sub>s which are then used to analytically determine macroscopic pK<sub>a</sub>s.</div><div>Our Gaussian process is trained on a set of 2,700 macroscopic pK<sub>a</sub>s from monoprotic and select diprotic molecules.</div><div>Here, we share our results for microscopic and macroscopic predictions in the SAMPL6 challenge.</div><div>Overall, we ranked in the middle of the pack compared to other participants, but our fairly good agreement with experiment is still promising considering the challenge molecules are chemically diverse and often polyprotic while our training set is predominately monoprotic.</div><div>Of particular importance to us when building this model was to include an uncertainty estimate based on the chemistry of the molecule that would reflect the likely accuracy of our prediction. </div><div>Our model reports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability, along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.</div><div>The challenge highlighted a variety of means to improve our model, including adding more polyprotic molecules to our training set and more carefully considering what functional groups we do or do not identify as ionizable. </div>

Download Full-text

Minimally Empirical Double Hybrid Functionals Trained Against the GMTKN55 Database: revDSD-PBEP86-D4, revDOD-PBE-D4, and DOD-SCAN-D4

10.26434/chemrxiv.7903388.v2 ◽

2019 ◽

Cited By ~ 1

Author(s):

Golokesh Santra ◽

Nitai Sylvetsky ◽

Gershom Martin

Keyword(s):

Substantial Improvement ◽

Viable Alternative ◽

Mean Absolute Deviation ◽

Dispersion Correction ◽

Training Set ◽

Weighted Mean ◽

Absolute Deviation ◽

Hybrid Functionals ◽

Scaling Algorithms

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.

Download Full-text

Minimally Empirical Double Hybrid Functionals Trained Against the GMTKN55 Database: revDSD-PBEP86-D4, revDOD-PBE-D4, and DOD-SCAN-D4

10.26434/chemrxiv.7903388.v1 ◽

2019 ◽

Author(s):

Golokesh Santra ◽

Nitai Sylvetsky ◽

Gershom Martin

Keyword(s):

Substantial Improvement ◽

Viable Alternative ◽

Mean Absolute Deviation ◽

Dispersion Correction ◽

Training Set ◽

Weighted Mean ◽

Absolute Deviation ◽

Hybrid Functionals ◽

Scaling Algorithms

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.

Download Full-text

PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

Jurnal Air Indonesia ◽

10.29122/jai.v12i2.4364 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Mohammad Haekal ◽

Henki Bayu Seta ◽

Mayanda Mega Santoni

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Online Monitoring ◽

Training Set ◽

Microsoft Excel ◽

Test Set

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

Download Full-text

Investigating the Potential for Training Context Effects to Influence Forensic Document Examiners' Relative Skill at Writer Individualization and Exclusion

Journal of Forensic Document Examination ◽

10.31974/jfde25-27-35 ◽

2015 ◽

Vol 25 ◽

pp. 27-35 ◽

Cited By ~ 1

Author(s):

Tonya Trubshoe ◽

Bryan Found

Keyword(s):

Context Effects ◽

Test Results ◽

Training Context ◽

Training Set ◽

Blind Trial ◽

Relative Ability ◽

Trial Test ◽

Case Files ◽

Document Examiners

The relative ability of forensic document examiners (FDEs) to provide support for the proposition of individualization or exclusion on the basis of handwriting features was investigated by surveying opinions expressed in case files by one laboratory’s FDEs and comparing this data to blind trial test results taken over a five year period. The survey of FDEs opinions on reports showed that opinions were skewed towards support for writer individualization over writer exclusion 92% of the time. Since historically FDEs develop their skills with respect to individualization/exclusion primarily on case files, it is proposed that this unbalanced training context may skew their abilities to carry out the tasks. To determine one laboratory’s capacity to correctly provide both individualization and exclusion evidence, results of blind validation trials were analyzed. For natural writing written and not written by the specimen writer, FDEs were 62 times more inconclusive when providing support for exclusion of the specimen writer when the specimen writer did not author the questioned sample, than they were for providing support for individualization when the specimen writer wrote the questioned sample. An intriguing possibility is that because of the unbalanced training set, government FDEs may acquire skills which are skewed towards individualization over exclusion. Purchase Article - $10

Download Full-text

Equalization of the Training Set For Backpropagation Networks Applied to Classification Problems

10.21528/cbrn1994-016 ◽

2016 ◽

Author(s):

Frederico dos Santos Liporace ◽

Ricardo José Machado ◽

Valmir C. Barbosa

Keyword(s):

Classification Problems ◽

Training Set

Download Full-text