calibration sets
Recently Published Documents


TOTAL DOCUMENTS

45
(FIVE YEARS 6)

H-INDEX

13
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Andrea Morger ◽  
Marina Garcia de Lomana ◽  
Ulf Norinder ◽  
Fredrik Svensson ◽  
Johannes Kirchmair ◽  
...  

Abstract Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.


2021 ◽  
Vol 39 ◽  
pp. 103126
Author(s):  
Lucas R. Martindale Johnson ◽  
Jeffrey R. Ferguson ◽  
Kyle P. Freund ◽  
Lee Drake ◽  
Daron Duke
Keyword(s):  
X Ray ◽  

Author(s):  
Hans-Jürgen Auinger ◽  
Christina Lehermeier ◽  
Daniel Gianola ◽  
Manfred Mayer ◽  
Albrecht E. Melchinger ◽  
...  

Abstract Key message Model training on data from all selection cycles yielded the highest prediction accuracy by attenuating specific effects of individual cycles. Expected reliability was a robust predictor of accuracies obtained with different calibration sets. Abstract The transition from phenotypic to genome-based selection requires a profound understanding of factors that determine genomic prediction accuracy. We analysed experimental data from a commercial maize breeding programme to investigate if genomic measures can assist in identifying optimal calibration sets for model training. The data set consisted of six contiguous selection cycles comprising testcrosses of 5968 doubled haploid lines genotyped with a minimum of 12,000 SNP markers. We evaluated genomic prediction accuracies in two independent prediction sets in combination with calibration sets differing in sample size and genomic measures (effective sample size, average maximum kinship, expected reliability, number of common polymorphic SNPs and linkage phase similarity). Our results indicate that across selection cycles prediction accuracies were as high as 0.57 for grain dry matter yield and 0.76 for grain dry matter content. Including data from all selection cycles in model training yielded the best results because interactions between calibration and prediction sets as well as the effects of different testers and specific years were attenuated. Among genomic measures, the expected reliability of genomic breeding values was the best predictor of empirical accuracies obtained with different calibration sets. For grain yield, a large difference between expected and empirical reliability was observed in one prediction set. We propose to use this difference as guidance for determining the weight phenotypic data of a given selection cycle should receive in model retraining and for selection when both genomic breeding values and phenotypes are available.


Foods ◽  
2021 ◽  
Vol 10 (2) ◽  
pp. 233
Author(s):  
Julio Nogales-Bueno ◽  
Francisco José Rodríguez-Pulido ◽  
Berta Baca-Bocanegra ◽  
Dolores Pérez-Marin ◽  
Francisco José Heredia ◽  
...  

Developing chemometric models from near-infrared (NIR) spectra requires the use of a representative calibration set of the entire population. Therefore, generally, the calibration procedure requires a large number of resources. For that reason, there is a great interest in identifying the most spectrally representative samples within a large population set. In this study, principal component and hierarchical clustering analyses have been compared for their ability to provide different representative calibration sets. The calibration sets generated have been used to control the technological maturity of grapes and total phenolic compounds of grape skins in red and white cultivars. Finally, the accuracy and precision of the models obtained with these calibration sets resulted from the application of the selection algorithms studied have been compared with each other and with the whole set of samples using an external validation set. Most of the standard errors of prediction (SEP) in external validation obtained from the reduced data sets were not significantly different from those obtained using the whole data set. Moreover, sample subsets resulting from hierarchical clustering analysis appear to produce slightly better results.


2020 ◽  
Vol 496 (3) ◽  
pp. 2962-2997 ◽  
Author(s):  
C Maraston ◽  
L Hill ◽  
D Thomas ◽  
R Yan ◽  
Y Chen ◽  
...  

ABSTRACT We use the first release of the SDSS/MaStar stellar library comprising ∼9000, high S/N spectra, to calculate integrated spectra of stellar population models. The models extend over the wavelength range 0.36–1.03 µm and share the same spectral resolution ($R\sim 1800$) and flux calibration as the SDSS-IV/MaNGA galaxy data. The parameter space covered by the stellar spectra collected thus far allows the calculation of models with ages and chemical composition in the range $\rm {\mathit{ t}\gt 200 \,Myr, -2 \lt = [Z/H] \lt = + 0.35}$, which will be extended as MaStar proceeds. Notably, the models include spectra for dwarf main-sequence stars close to the core H-burning limit, as well as spectra for cold, metal-rich giants. Both stellar types are crucial for modelling λ > 0.7 µm absorption spectra. Moreover, a better parameter coverage at low metallicity allows the calculation of models as young as 500 Myr and the full account of the blue horizontal branch phase of old populations. We present models adopting two independent sets of stellar parameters (Teff, log g, [Z/H]). In a novel approach, their reliability is tested ‘on the fly’ using the stellar population models themselves. We perform tests with Milky Way and Magellanic Clouds globular clusters, finding that the new models recover their ages and metallicities remarkably well, with systematics as low as a few per cent for homogeneous calibration sets. We also fit a MaNGA galaxy spectrum, finding residuals of the order of a few per cent comparable to the state-of-art models, but now over a wider wavelength range.


2019 ◽  
Vol 11 (4) ◽  
pp. 450 ◽  
Author(s):  
Yi Liu ◽  
Yaolin Liu ◽  
Yiyun Chen ◽  
Yang Zhang ◽  
Tiezhu Shi ◽  
...  

In constructing models for predicting soil organic matter (SOM) by using visible and near-infrared (vis–NIR) spectroscopy, the selection of representative calibration samples is decisive. Few researchers have studied the inclusion of spectral pretreatments in the sample selection strategy. We collected 108 soil samples and applied six commonly used spectral pretreatments to preprocess soil spectra, namely, Savitzky–Golay (SG) smoothing, first derivative (FD), logarithmic function log(1/R), mean centering (MC), standard normal variate (SNV), and multiplicative scatter correction (MSC). Then, the Kennard–Stone (KS) strategy was used to select calibration samples based on the pretreated spectra, and the size of the calibration set varied from 10 samples to 86 samples (80% of the total samples). These calibration sets were employed to construct partial least squares regression models (PLSR) to predict SOM, and the built models were validated by a set of 21 samples (20% of the total samples). The results showed that 64−78% of the calibration sets selected by the inclusion of pretreatment demonstrated significantly better performance of SOM estimation. The average improved residual predictive deviations (ΔRPD) were 0.06, 0.13, 0.19, and 0.13 for FD, log(1/R), MSC, and SNV, respectively. Thus, we concluded that spectral pretreatment improves the sample selection strategy, and the degree of its influence varies with the size of the calibration set and the type of pretreatment.


2018 ◽  
Author(s):  
Thibault Leroy ◽  
Yoann Anselmetti ◽  
Marie-Ka Tilak ◽  
Sèverine Bérard ◽  
Laura Csukonyi ◽  
...  

ABSTRACTChromosomal organization is relatively stable among avian species, especially with regards to sex chromosomes. Members of the large Sylvioidea clade however have a pair of neo-sex chromosomes which is unique to this clade and originate from a parallel translocation of a region of the ancestral 4A chromosome on both W and Z chromosomes. Here, we took advantage of this unusual event to study the early stages of sex chromosome evolution. To do so, we sequenced a female (ZW) of two Sylvioidea species, a Zosterops borbonicus and a Z. pallidus. Then, we organized the Z. borbonicus scaffolds along chromosomes and annotated genes. Molecular phylogenetic dating under various methods and calibration sets confidently confirmed the recent diversification of the genus Zosterops (1-3.5 million years ago), thus representing one of the most exceptional rates of diversification among vertebrates. We then combined genomic coverage comparisons of five males and seven females, and homology with the zebra finch genome (Taeniopygia guttata) to identify sex chromosome scaffolds, as well as the candidate chromosome breakpoints for the two translocation events. We observed reduced levels of within-species diversity in both translocated regions and, as expected, even more so on the neoW chromosome. In order to compare the rates of molecular evolution in genomic regions of the autosomal-to-sex transitions, we then estimated the ratios of non-synonymous to synonymous polymorphisms (πN/πS) and substitutions (dN/dS). Based on both ratios, no or little contrast between autosomal and Z genes was observed, thus representing a very different outcome than the higher ratios observed at the neoW genes. In addition, we report significant changes in base composition content for translocated regions on the W and Z chromosomes and a large accumulation of transposable elements (TE) on the newly W region. Our results revealed contrasted signals of molecular evolution changes associated to these autosome-to-sex transitions, with congruent signals of a W chromosome degeneration yet a surprisingly weak support for a fast-Z effect.


2017 ◽  
Vol 13 (9) ◽  
pp. 1285-1300 ◽  
Author(s):  
Wei Ding ◽  
Qinghai Xu ◽  
Pavel E. Tarasov

Abstract. Human impact is a well-known confounder in pollen-based quantitative climate reconstructions as most terrestrial ecosystems have been artificially affected to varying degrees. In this paper, we use a human-induced pollen dataset (H-set) and a corresponding natural pollen dataset (N-set) to establish pollen–climate calibration sets for temperate eastern China (TEC). The two calibration sets, taking a weighted averaging partial least squares (WA-PLS) approach, are used to reconstruct past climate variables from a fossil record, which is located at the margin of the East Asian summer monsoon in north-central China and covers the late glacial Holocene from 14.7 ka BP (thousands of years before AD 1950). Ordination results suggest that mean annual precipitation (Pann) is the main explanatory variable of both pollen composition and percentage distributions in both datasets. The Pann reconstructions, based on the two calibration sets, demonstrate consistently similar patterns and general trends, suggesting a relatively strong climate impact on the regional vegetation and pollen spectra. However, our results also indicate that the human impact may obscure climate signals derived from fossil pollen assemblages. In a test with modern climate and pollen data, the Pann influence on pollen distribution decreases in the H-set, while the human influence index (HII) rises. Moreover, the relatively strong human impact reduces woody pollen taxa abundances, particularly in the subhumid forested areas. Consequently, this shifts their model-inferred Pann optima to the arid end of the gradient compared to Pann tolerances in the natural dataset and further produces distinct deviations when the total tree pollen percentages are high (i.e. about 40 % for the Gonghai area) in the fossil sequence. In summary, the calibration set with human impact used in our experiment can produce a reliable general pattern of past climate, but the human impact on vegetation affects the pollen–climate relationship and biases the pollen-based climate reconstruction. The extent of human-induced bias may be rather small for the entire late glacial and early Holocene interval when we use a reference set called natural. Nevertheless, this potential bias should be kept in mind when conducting quantitative reconstructions, especially for the recent 2 or 3 millennia.


Sign in / Sign up

Export Citation Format

Share Document