Comparison of Single-Breed and Multi-Breed Training Populations for Infrared Predictions of Novel Phenotypes in Holstein Cows

In general, Fourier-transform infrared (FTIR) predictions are developed using a single-breed population split into a training and a validation set. However, using populations formed of different breeds is an attractive way to design cross-validation scenarios aimed at increasing prediction for difficult-to-measure traits in the dairy industry. This study aimed to evaluate the potential of FTIR prediction using training set combining specialized and dual-purpose dairy breeds to predict different phenotypes divergent in terms of biological meaning, variability, and heritability, such as body condition score (BCS), serum β-hydroxybutyrate (BHB), and kappa casein (k-CN) in the major cattle breed, i.e., Holstein-Friesian. Data were obtained from specialized dairy breeds: Holstein (468 cows) and Brown Swiss (657 cows), and dual-purpose breeds: Simmental (157 cows), Alpine Grey (75 cows), and Rendena (104 cows), giving a total of 1461 cows from 41 multi-breed dairy herds. The FTIR prediction model was developed using a gradient boosting machine (GBM), and predictive ability for the target phenotype in Holstein cows was assessed using different cross-validation (CV) strategies: a within-breed scenario using 10-fold cross-validation, for which the Holstein population was randomly split into 10 folds, one for validation and the remaining nine for training (10-fold_HO); an across-breed scenario (BS_HO) where the Brown Swiss cows were used as the training set and the Holstein cows as the validation set; a specialized multi-breed scenario (BS+HO_10-fold), where the entire Brown Swiss and Holstein populations were combined then split into 10 folds, and a multi-breed scenario (Multi-breed), where the training set comprised specialized (Holstein and Brown Swiss) and dual-purpose (Simmental, Alpine Grey, and Rendena) dairy cows, combined with nine folds of the Holstein cows. Lastly a Multi-breed CV2 scenario was implemented, assuming the same number of records as the reference scenario and using the same proportions as the multi-breed. Within-Holstein, FTIR predictions had a predictive ability of 0.63 for BCS, 0.81 for BHB, and 0.80 for k-CN. Using a specific breed (Brown Swiss) as the training set for prediction in the Holstein population reduced the prediction accuracy by 10% for BCS, 7% for BHB, and 11% for κ-CN. Notably, the combination of Holstein and Brown Swiss cows in the training set increased the predictive ability of the model by 6%, which was 0.66 for BCS, 0.85 for BHB, and 0.87 for k-CN. Using multiple specialized and dual-purpose animals in the training set outperforms the 10-fold_HO (standard) approach, with an increase in predictive ability of 8% for BCS, 7% for BHB, and 10% for k-CN. When the Multi-breed CV2 was implemented, no improvement was observed. Our findings suggest that FTIR prediction of different phenotypes in the Holstein breed can be improved by including different specialized and dual-purpose breeds in the training population. Our study also shows that predictive ability is enhanced when the size of the training population and the phenotypic variability are increased

Download Full-text

Preoperative Predicting the WHO/ISUP Nuclear Grade of Clear Cell Renal Cell Carcinoma by Computed Tomography-Based Radiomics Features

Journal of Personalized Medicine ◽

10.3390/jpm11010008 ◽

2020 ◽

Vol 11 (1) ◽

pp. 8

Author(s):

Claudia-Gabriela Moldovanu ◽

Bianca Boca ◽

Andrei Lebovici ◽

Attila Tamas-Szora ◽

Diana Sorina Feier ◽

...

Keyword(s):

Computed Tomography ◽

Renal Cell Carcinoma ◽

Renal Cell ◽

Clear Cell ◽

Predictive Ability ◽

Nuclear Grade ◽

Training Set ◽

Cell Renal Cell Carcinoma ◽

Validation Set ◽

Sensitivity Specificity

Nuclear grade is important for treatment selection and prognosis in patients with clear cell renal cell carcinoma (ccRCC). This study aimed to determine the ability of preoperative four-phase multiphasic multidetector computed tomography (MDCT)-based radiomics features to predict the WHO/ISUP nuclear grade. In all 102 patients with histologically confirmed ccRCC, the training set (n = 62) and validation set (n = 40) were randomly assigned. In both datasets, patients were categorized according to the WHO/ISUP grading system into low-grade ccRCC (grades 1 and 2) and high-grade ccRCC (grades 3 and 4). The feature selection process consisted of three steps, including least absolute shrinkage and selection operator (LASSO) regression analysis, and the radiomics scores were developed using 48 radiomics features (10 in the unenhanced phase, 17 in the corticomedullary (CM) phase, 14 in the nephrographic (NP) phase, and 7 in the excretory phase). The radiomics score (Rad-Score) derived from the CM phase achieved the best predictive ability, with a sensitivity, specificity, and an area under the curve (AUC) of 90.91%, 95.00%, and 0.97 in the training set. In the validation set, the Rad-Score derived from the NP phase achieved the best predictive ability, with a sensitivity, specificity, and an AUC of 72.73%, 85.30%, and 0.84. We constructed a complex model, adding the radiomics score for each of the phases to the clinicoradiological characteristics, and found significantly better performance in the discrimination of the nuclear grades of ccRCCs in all MDCT phases. The highest AUC of 0.99 (95% CI, 0.92–1.00, p < 0.0001) was demonstrated for the CM phase. Our results showed that the MDCT radiomics features may play a role as potential imaging biomarkers to preoperatively predict the WHO/ISUP grade of ccRCCs.

Download Full-text

Evaluating the accuracy of equivalent-source predictions using cross-validation

10.5194/egusphere-egu2020-15729 ◽

2020 ◽

Author(s):

Leonardo Uieda ◽

Santiago Soler

Keyword(s):

Prediction Accuracy ◽

Cross Validation ◽

Point Sources ◽

Magnetic Data ◽

Random Permutations ◽

Training Set ◽

Equivalent Source ◽

Upward Continuation ◽

Reduction To The Pole ◽

Validation Set

We investigate the use of cross-validation (CV) techniques to estimate the accuracy of equivalent-source (also known as equivalent-layer) models for interpolation and processing of potential-field data. Our preliminary results indicate that some common CV algorithms (e.g., random permutations and k-folds) tend to overestimate the accuracy. We have found that blocked CV methods, where the data are split along spatial blocks instead of randomly, provide more conservative and realistic accuracy estimates. Beyond evaluating an equivalent-source model's performance, cross-validation can be used to automatically determine configuration parameters, like source depth and amount of regularization, that maximize prediction accuracy and avoid over-fitting.Widely used in gravity and magnetic data processing, the equivalent-source technique consists of a linear model (usually point sources) used to predict the observed field at arbitrary locations. Upward-continuation, interpolation, gradient calculations, leveling, and reduction-to-the-pole can be performed simultaneously by using the model to make predictions (i.e., forward modelling). Likewise, the use of linear models to make predictions is the backbone of many machine learning (ML) applications. The predictive performance of ML models is usually evaluated through cross-validation, in which the data are split (usually randomly) into a training set and a validation set. Models are fit on the training set and their predictions are evaluated using the validation set using a goodness-of-fit metric, like the mean square error or the R&#178; coefficient of determination. Many cross-validation methods exist in the literature, varying in how the data are split and how this process is repeated. Prior research from the statistical modelling of ecological data suggests that prediction accuracy is usually overestimated by traditional CV methods when the data are spatially auto-correlated. This issue can be mitigated by splitting the data along spatial blocks rather than randomly. We conducted experiments on synthetic gravity data to investigate the use of traditional and blocked CV methods in equivalent-source interpolation. We found that the overestimation problem also occurs and that more conservative accuracy estimates are obtained when applying blocked versions of random permutations and k-fold. Further studies need to be conducted to generalize these findings to upward-continuation, reduction-to-the-pole, and derivative calculation.Open-source software implementations of the equivalent-source and blocked cross-validation (in progress) methods are available in the Python libraries Harmonica and Verde, which are part of the Fatiando a Terra project (www.fatiando.org).

Download Full-text

Improving Prediction Accuracy Using Multi-allelic Haplotype Prediction and Training Population Optimization in Wheat

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401165 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2265-2273 ◽

Cited By ~ 1

Author(s):

Ahmad H. Sallam ◽

Emily Conley ◽

Dzianis Prakapenka ◽

Yang Da ◽

James A. Anderson

Keyword(s):

Population Structure ◽

Protein Content ◽

Prediction Accuracy ◽

Cross Validation ◽

Predictive Ability ◽

Training Population ◽

Percentage Points ◽

And Training ◽

Fold Cross Validation ◽

Single Snps

The use of haplotypes may improve the accuracy of genomic prediction over single SNPs because haplotypes can better capture linkage disequilibrium and genomic similarity in different lines and may capture local high-order allelic interactions. Additionally, prediction accuracy could be improved by portraying population structure in the calibration set. A set of 383 advanced lines and cultivars that represent the diversity of the University of Minnesota wheat breeding program was phenotyped for yield, test weight, and protein content and genotyped using the Illumina 90K SNP Assay. Population structure was confirmed using single SNPs. Haplotype blocks of 5, 10, 15, and 20 adjacent markers were constructed for all chromosomes. A multi-allelic haplotype prediction algorithm was implemented and compared with single SNPs using both k-fold cross validation and stratified sampling optimization. After confirming population structure, the stratified sampling improved the predictive ability compared with k-fold cross validation for yield and protein content, but reduced the predictive ability for test weight. In all cases, haplotype predictions outperformed single SNPs. Haplotypes of 15 adjacent markers showed the best improvement in accuracy for all traits; however, this was more pronounced in yield and protein content. The combined use of haplotypes of 15 adjacent markers and training population optimization significantly improved the predictive ability for yield and protein content by 14.3 (four percentage points) and 16.8% (seven percentage points), respectively, compared with using single SNPs and k-fold cross validation. These results emphasize the effectiveness of using haplotypes in genomic selection to increase genetic gain in self-fertilized crops.

Download Full-text

Metabolomic Spectra For Phenotypic Prediction of Malting Quality In Spring Barley

10.21203/rs.3.rs-1113863/v1 ◽

2021 ◽

Author(s):

Xiangyu Guo ◽

Ahmed Jahoor ◽

Just Jensen ◽

Pernille Sarup

Keyword(s):

Prediction Accuracy ◽

Cross Validation ◽

Spring Barley ◽

Predictive Ability ◽

Malting Quality ◽

Least Squares Regression ◽

Training Population ◽

Prediction Ability ◽

Rate Of Increase ◽

Best Linear Unbiased

Abstract The objectives were to investigate prediction of malting quality (MQ) phenotypes in different locations using information from metabolomic spectra, and compare the prediction ability using different models and different sizes of training population (TP). A total of 2,667 plots of 564 malting spring barley lines from three years and two locations were included. Five MQ traits were measured in wort produced from each individual plot. Metabolomic features (MFs) used were 24,018 NMR intensities measured on each wort sample. Models involved in the statistical analyses were a metabolomic best linear unbiased prediction (MBLUP) model and a partial least squares regression (PLSR) model. Predictive ability within location and across locations were compared using cross-validation methods. The proportion of variance in MQ traits that could be explained by effects of MFs was above 0.9 for all traits. The prediction accuracy increased with increasing TP size but when the TP size reached 1,000, the rate of increase was negligible. The number of components considered in the PLSR models can affect the performance of PLSR models and 20 components were optimal. The accuracy of individual plots and line means using leave-one-line-out cross-validation ranged from 0.722 to 0.865 and using leave-one-location-out cross-validation ranged from 0.517 to 0.817.In conclusion, it is possible to carry out metabolomic prediction of MQ traits using MFs, the prediction accuracy is high and MBLUP is better than PLSR if the training population is larger than 100. The results have significant implications for practical barley breeding for malting quality.

Download Full-text

Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications

Healthcare Informatics Research ◽

10.4258/hir.2021.27.3.189 ◽

2021 ◽

Vol 27 (3) ◽

pp. 189-199

Author(s):

Ilias Tougui ◽

Abdelilah Jilbab ◽

Jamal El Mhamdi

Keyword(s):

Machine Learning ◽

Clinical Study ◽

Cross Validation ◽

Learning Technologies ◽

Data Availability ◽

Support Vector ◽

Training Set ◽

The Subject ◽

Validation Set ◽

Diagnostic Applications

Objectives: With advances in data availability and computing capabilities, artificial intelligence and machine learning technologies have evolved rapidly in recent years. Researchers have taken advantage of these developments in healthcare informatics and created reliable tools to predict or classify diseases using machine learning-based algorithms. To correctly quantify the performance of those algorithms, the standard approach is to use cross-validation, where the algorithm is trained on a training set, and its performance is measured on a validation set. Both datasets should be subject-independent to simulate the expected behavior of a clinical study. This study compares two cross-validation strategies, the subject-wise and the record-wise techniques; the subject-wise strategy correctly mimics the process of a clinical study, while the record-wise strategy does not.Methods: We started by creating a dataset of smartphone audio recordings of subjects diagnosed with and without Parkinson’s disease. This dataset was then divided into training and holdout sets using subject-wise and the record-wise divisions. The training set was used to measure the performance of two classifiers (support vector machine and random forest) to compare six cross-validation techniques that simulated either the subject-wise process or the record-wise process. The holdout set was used to calculate the true error of the classifiers.Results: The record-wise division and the record-wise cross-validation techniques overestimated the performance of the classifiers and underestimated the classification error.Conclusions: In a diagnostic scenario, the subject-wise technique is the proper way of estimating a model’s performance, and record-wise techniques should be avoided.

Download Full-text

Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens

Methods of Information in Medicine ◽

10.3414/me9226 ◽

2009 ◽

Vol 48 (03) ◽

pp. 242-247 ◽

Cited By ~ 2

Author(s):

S. Andreassen ◽

L. Leibovici ◽

M. Paul ◽

A. Zalounina

Keyword(s):

Maximum Likelihood ◽

Antibiotic Therapy ◽

Cross Validation ◽

Optimal Size ◽

Empirical Antibiotic Therapy ◽

Training Set ◽

Validation Set ◽

Using Data ◽

Selection Of

Summary Objectives: Selection of empirical antibiotic therapy relies on knowledge of the in vitro susceptibilities of potential pathogens to antibiotics. In this paper the limitations of this knowledge are outlined and a method that can reduce some of the problems is developed. Methods: We propose hierarchical Dirichlet learning for estimation of pathogen susceptibilities to antibiotics, using data from a group of similar pathogens in a bacteremia database. Results: A threefold cross-validation showed that maximum likelihood (ML) estimates of susceptibilities based on individual pathogens gave a distance between estimates obtained from the training set and observed frequencies in the validation set of 16.3%. Estimates based on the initial grouping of pathogens gave a distance of 16.7%. Dirichlet learning gave a distance of 15.6%. Inspection of the pathogen groups led to subdivision of three groups, Citrobacter, Other Gram Negatives and Acinetobacter, out of 26 groups. Estimates based on the subdivided groups gave a distance of 15.4% and Dirichlet learning further reduced this to 15.0%. The optimal size of the imaginary sample inherited from the group was 3. Conclusion: Dirichlet learning improved estimates of susceptibilities relative to ML estimators based on individual pathogens and to classical grouped estimators. The initial pathogen grouping was well founded and improvement by subdivision of the groups was only obtained in three groups. Dirichlet learning was robust to these revisions of the grouping, giving improved estimates in both cases, while the group-based estimates only gave improved estimates after the revision of the groups.

Download Full-text

MRI-Based Bone Marrow Radiomics Nomogram for Prediction of Overall Survival in Patients With Multiple Myeloma

Frontiers in Oncology ◽

10.3389/fonc.2021.709813 ◽

2021 ◽

Vol 11 ◽

Author(s):

Yang Li ◽

Yang Liu ◽

Ping Yin ◽

Chuanxi Hao ◽

Chao Sun ◽

...

Keyword(s):

Multiple Myeloma ◽

Bone Marrow ◽

Overall Survival ◽

Predictive Ability ◽

Staging System ◽

Training Set ◽

Clinical Model ◽

Significant Difference ◽

Validation Set ◽

Radiomics Signature

PurposeTo develop and validate a radiomics nomogram for predicting overall survival (OS) in multiple myeloma (MM) patients.Material and MethodsA total of 121 MM patients was enrolled and divided into training (n=84) and validation (n=37) sets. The radiomics signature was established by the selected radiomics features from lumbar MRI. The radiomics signature and clinical risk factors were integrated in multivariate Cox regression model for constructing radiomics nomogram to predict MM OS. The predictive ability and accuracy of the nomogram were evaluated by the index of concordance (C-index) and calibration curves, and compared with other four models including the clinical model, radiomics signature model, the Durie-Salmon staging system (D-S) and the International Staging System (ISS). The potential association between the radiomics signature and progression-free survival (PFS) was also explored.ResultsThe radiomics signature, 1q21 gain, del (17p), and β2-MG≥5.5 mg/L showed significant association with MM OS. The predictive ability of radiomics nomogram was better than the clinical model, radiomics signature model, the D-S and the ISS (C-index: 0.793 vs. 0.733 vs. 0.742 vs. 0.554 vs. 0.671 in training set, and 0.812 vs. 0.799 vs.0.717 vs. 0.512 vs. 0.761 in validation set). The radiomics signature lacked the predictive ability for PFS (log-rank P=0.001 in training set and log-rank P=0.103 in validation set), whereas the 1-, 2- and 3-year PFS rates all showed significant difference between the high and low risk groups (P ≤ 0.05).ConclusionThe MRI-based bone marrow radiomics may be an additional useful tool for MM OS prediction.

Download Full-text

Computer-aided drug design of capuramycin analogues as anti-tuberculosis antibiotics by 3D-QSAR and molecular docking

Open Chemistry ◽

10.1515/chem-2017-0039 ◽

2017 ◽

Vol 15 (1) ◽

pp. 299-307

Author(s):

Yuanyuan Jin ◽

Shuai Fan ◽

Guangxin Lv ◽

Haoyi Meng ◽

Zhengyang Sun ◽

...

Keyword(s):

Molecular Docking ◽

Mycobacterium Smegmatis ◽

Predictive Ability ◽

3D Qsar ◽

Docking Studies ◽

Training Set ◽

Molecular Docking Studies ◽

Computer Aided ◽

Pls Analysis ◽

Validation Set

AbstractCapuramycin and a few semisynthetic derivatives have shown potential as anti-tuberculosis antibiotics.To understand their mechanism of action and structureactivity relationships a 3D-QSAR and molecular docking studies were performed. A set of 52 capuramycin derivatives for the training set and 13 for the validation set was used. A highly predictive MFA model was obtained with crossvalidated q2 of 0.398, and non-cross validated partial least-squares (PLS) analysis showed a conventional r2 of 0.976 and r2pred of 0.839. The model has an excellent predictive ability. Combining the 3D-QSAR and molecular docking studies, a number of new capuramycin analogs with predicted improved activities were designed. Biological activity tests of one analog showed useful antibiotic activity against Mycobacterium smegmatis MC2 155 and Mycobacterium tuberculosis H37Rv. Computer-aided molecular docking and 3D-QSAR can improve the design of new capuramycin antimycobacterial antibiotics.

Download Full-text

PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

Jurnal Air Indonesia ◽

10.29122/jai.v12i2.4364 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Mohammad Haekal ◽

Henki Bayu Seta ◽

Mayanda Mega Santoni

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Online Monitoring ◽

Training Set ◽

Microsoft Excel ◽

Test Set

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

Download Full-text

Discriminating Malignancy in Thyroid Nodules: The Nomogram Versus the Kwak and ACR TI-RADS

Otolaryngology ◽

10.1177/0194599820939071 ◽

2020 ◽

Vol 163 (6) ◽

pp. 1156-1165

Author(s):

Juan Xiao ◽

Qiang Xiao ◽

Wei Cong ◽

Ting Li ◽

Shouluan Ding ◽

...

Keyword(s):

Thyroid Nodules ◽

Characteristic Curve ◽

Area Under The Curve ◽

Diagnostic Study ◽

Diagnostic Efficiency ◽

Training Set ◽

Multivariable Logistic Regression Model ◽

Predictive Values ◽

Validation Set ◽

Sensitivity Specificity

Objective To develop an easy-to-use nomogram for discrimination of malignant thyroid nodules and to compare diagnostic efficiency with the Kwak and American College of Radiology (ACR) Thyroid Imaging, Reporting and Data System (TI-RADS). Study Design Retrospective diagnostic study. Setting The Second Hospital of Shandong University. Subjects and Methods From March 2017 to April 2019, 792 patients with 1940 thyroid nodules were included into the training set; from May 2019 to December 2019, 174 patients with 389 nodules were included into the validation set. Multivariable logistic regression model was used to develop a nomogram for discriminating malignant nodules. To compare the diagnostic performance of the nomogram with the Kwak and ACR TI-RADS, the area under the receiver operating characteristic curve, sensitivity, specificity, and positive and negative predictive values were calculated. Results The nomogram consisted of 7 factors: composition, orientation, echogenicity, border, margin, extrathyroidal extension, and calcification. In the training set, for all nodules, the area under the curve (AUC) for the nomogram was 0.844, which was higher than the Kwak TI-RADS (0.826, P = .008) and the ACR TI-RADS (0.810, P < .001). For the 822 nodules >1 cm, the AUC of the nomogram was 0.891, which was higher than the Kwak TI-RADS (0.852, P < .001) and the ACR TI-RADS (0.853, P < .001). In the validation set, the AUC of the nomogram was also higher than the Kwak and ACR TI-RADS ( P < .05), each in the whole series and separately for nodules >1 or ≤1 cm. Conclusions When compared with the Kwak and ACR TI-RADS, the nomogram had a better performance in discriminating malignant thyroid nodules.

Download Full-text