scholarly journals Genomic Prediction Enhanced Sparse Testing for Multi-environment Trials

2020 ◽  
Vol 10 (8) ◽  
pp. 2725-2739 ◽  
Author(s):  
Diego Jarquin ◽  
Reka Howard ◽  
Jose Crossa ◽  
Yoseph Beyene ◽  
Manje Gowda ◽  
...  

“Sparse testing” refers to reduced multi-environment breeding trials in which not all genotypes of interest are grown in each environment. Using genomic-enabled prediction and a model embracing genotype × environment interaction (GE), the non-observed genotype-in-environment combinations can be predicted. Consequently, the overall costs can be reduced and the testing capacities can be increased. The accuracy of predicting the unobserved data depends on different factors including (1) how many genotypes overlap between environments, (2) in how many environments each genotype is grown, and (3) which prediction method is used. In this research, we studied the predictive ability obtained when using a fixed number of plots and different sparse testing designs. The considered designs included the extreme cases of (1) no overlap of genotypes between environments, and (2) complete overlap of the genotypes between environments. In the latter case, the prediction set fully consists of genotypes that have not been tested at all. Moreover, we gradually go from one extreme to the other considering (3) intermediates between the two previous cases with varying numbers of different or non-overlapping (NO)/overlapping (O) genotypes. The empirical study is built upon two different maize hybrid data sets consisting of different genotypes crossed to two different testers (T1 and T2) and each data set was analyzed separately. For each set, phenotypic records on yield from three different environments are available. Three different prediction models were implemented, two main effects models (M1 and M2), and a model (M3) including GE. The results showed that the genome-based model including GE (M3) captured more phenotypic variation than the models that did not include this component. Also, M3 provided higher prediction accuracy than models M1 and M2 for the different allocation scenarios. Reducing the size of the calibration sets decreased the prediction accuracy under all allocation designs with M3 being the less affected model; however, using the genome-enabled models (i.e., M2 and M3) the predictive ability is recovered when more genotypes are tested across environments. Our results indicate that a substantial part of the testing resources can be saved when using genome-based models including GE for optimizing sparse testing designs.

Genetics ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Gustavo de los Campos

Abstract Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and in linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a Sparse Selection Index (SSI) that integrates Selection Index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-BLUP (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in ten different environments) that the SSI can achieve significant (anywhere between 5-10%) gains in prediction accuracy relative to the G-BLUP.


2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


2010 ◽  
Vol 09 (04) ◽  
pp. 547-573 ◽  
Author(s):  
JOSÉ BORGES ◽  
MARK LEVENE

The problem of predicting the next request during a user's navigation session has been extensively studied. In this context, higher-order Markov models have been widely used to model navigation sessions and to predict the next navigation step, while prediction accuracy has been mainly evaluated with the hit and miss score. We claim that this score, although useful, is not sufficient for evaluating next link prediction models with the aim of finding a sufficient order of the model, the size of a recommendation set, and assessing the impact of unexpected events on the prediction accuracy. Herein, we make use of a variable length Markov model to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score, and the Brier score. We present an extensive evaluation of the methods on real data sets and a comprehensive comparison of the scoring methods.


Author(s):  
UREERAT WATTANACHON ◽  
CHIDCHANOK LURSINSAP

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on predefined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyper-spherical clusters. In this paper, a new hybrid clustering algorithm called Self-Partition and Self-Merging (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density. Several examples for color image show the versatility of the proposed method and compare with results described in the literature for the same images. The computational complexity of the SPSM algorithm is O(N2), where N is the number of data points.


2015 ◽  
Author(s):  
Abelardo Montesinos-Lopez ◽  
Osval Montesinos-Lopez ◽  
Jose Crossa ◽  
Juan Burgueno ◽  
Kent Eskridge ◽  
...  

Genomic tools allow the study of the whole genome and are facilitating the study of genotype-environment combinations and their relationship with the phenotype. However, most genomic prediction models developed so far are appropriate for Gaussian phenotypes. For this reason, appropriate genomic prediction models are needed for count data, since the conventional regression models used on count data with a large sample size (n) and a small number of parameters (p) cannot be used for genomic-enabled prediction where the number of parameters (p) is larger than the sample size (n). Here we propose a Bayesian mixed negative binomial (BMNB) genomic regression model for counts that takes into account genotype by environment (G × E) interaction. We also provide all the full conditional distributions to implement a Gibbs sampler. We evaluated the proposed model using a simulated data set and a real wheat data set from the International Maize and Wheat Improvement Center (CIMMYT) and collaborators. Results indicate that our BMNB model is a viable alternative for analyzing count data.


10.29007/rh9l ◽  
2019 ◽  
Author(s):  
Cuauhtémoc López-Martín

Defect density (DD) is a measure to determine the effectiveness of software processes. DD is defined as the total number of defects divided by the size of the software. Software prediction is an activity of software planning. This study is related to the analysis of attributes of data sets commonly used for building DD prediction models. The data sets of software projects were selected from the International Software Benchmarking Standards Group (ISBSG) Release 2018. The selection criteria were based on attributes such as type of development, development platform, and programming language generation as suggested by the ISBSG. Since a lower size of data set is generated as mentioned criteria are observed, it avoids a good generalization for models. Therefore, in this study, a statistical analysis of data sets was performed with the objective of knowing if they could be pooled instead of using them as separated data sets. Results showed that there was no difference among the DD of new projects nor among the DD of enhancement projects, but there was a difference between the DD of new and enhancement projects. Results suggest that prediction models can separately be constructed for new projects and enhancement projects, but not by pooling new and enhancement ones.


Sensors ◽  
2021 ◽  
Vol 21 (21) ◽  
pp. 7271
Author(s):  
Jian Zhou ◽  
Jian Wang ◽  
Yang Chen ◽  
Xin Li ◽  
Yong Xie

Water environmental Internet of Things (IoT) system, which is composed of multiple monitoring points equipped with various water quality IoT devices, provides the possibility for accurate water quality prediction. In the same water area, water flows and exchanges between multiple monitoring points, resulting in an adjacency effect in the water quality information. However, traditional water quality prediction methods only use the water quality information of one monitoring point, ignoring the information of nearby monitoring points. In this paper, we propose a water quality prediction method based on multi-source transfer learning for a water environmental IoT system, in order to effectively use the water quality information of nearby monitoring points to improve the prediction accuracy. First, a water quality prediction framework based on multi-source transfer learning is constructed. Specifically, the common features in water quality samples of multiple nearby monitoring points and target monitoring points are extracted and then aligned. According to the aligned features of water quality samples, the water quality prediction models based on an echo state network at multiple nearby monitoring points are established with distributed computing, and then the prediction results of distributed water quality prediction models are integrated. Second, the prediction parameters of multi-source transfer learning are optimized. Specifically, the back propagates population deviation based on multiple iterations, reducing the feature alignment bias and the model alignment bias to improve the prediction accuracy. Finally, the proposed method is applied in the actual water quality dataset of Hong Kong. The experimental results demonstrate that the proposed method can make full use of the water quality information of multiple nearby monitoring points to train several water quality prediction models and reduce the prediction bias.


Author(s):  
Hua-ping Jia ◽  
Jun-long Zhao ◽  
Jun-Liu ◽  
Min-Zhang ◽  
Wei-Xi Sun

The stacking algorithm has better generalization ability than other learning algorithms, and can flexibly handle different tasks. The basic model of this algorithm uses heterogeneous learning devices (different types of learning devices), but for each data set in K-fold cross validation, the learners used are homogeneous (the same type of learner). Considering the neglect of the precision difference by a homogeneous heterotopic learner, the accuracy difference weighting method is proposed to improve the traditional stacking algorithm. In the first layer of the traditional stacking algorithm, the algorithm is weighted according to the prediction accuracy, that is, the output of the test set of the first layer is weighted by the weight calculated with the obtained precision, and the weighted result input into the element learner is taken as the feature. As one of the diseases with the highest incidence and mortality, the effective prediction of heart disease can provide an important basis for assisting diagnosis and enhancing the survival rate of patients. In this article, the improved stacking integration algorithm was used to construct a two-layer classifier model to predict heart disease. The experimental results show that the algorithm can effectively improve the prediction accuracy of heart disease through the verification of other heart disease data sets, and it is found that the stacking algorithm has better generalization performance.


2021 ◽  
Vol 143 (11) ◽  
Author(s):  
Mohsen Faramarzi-Palangar ◽  
Behnam Sedaee ◽  
Mohammad Emami Niri

Abstract The correct definition of rock types plays a critical role in reservoir characterization, simulation, and field development planning. In this study, we use the critical pore size (linf) as an approach for reservoir rock typing. Two linf relations were separately derived based on two permeability prediction models and then merged together to drive a generalized linf relation. The proposed rock typing methodology includes two main parts: in the first part, we determine an appropriate constant coefficient, and in the second part, we perform reservoir rock typing based on two different scenarios. The first scenario is based on the forming groups of rocks using statistical analysis, and the second scenario is based on the forming groups of rocks with similar capillary pressure curves. This approach was applied to three data sets. In detail, two data sets were used to determine the constant coefficient, and one data set was used to show the applicability of the linf method in comparison with FZI for rock typing.


2015 ◽  
Vol 5 (4) ◽  
pp. 402-420 ◽  
Author(s):  
Guobing Wu ◽  
Hao Zhang ◽  
Ping Chen

Purpose – In this paper, six forms of non-linear Taylor rule have been applied to compare the fitting and prediction of response function of monetary policy of China, in an attempt to figure out a form of non-linear Taylor rule that accords with Chinese practices. The paper aims to discuss this issue. Design/methodology/approach – In this paper, the authors will conduct in-sample fitting and out-of-sample prediction on the response function of monetary policy of China by introducing the factor of exchange rate and by applying forward-looking, backward-looking and within-quarters non-linear Taylor rule with data from the first quarter of 1994 to the second quarter of 2011, with a view to provide reference for formulation and implementation of monetary policies of China. Findings – By analyzing the experimental data, the authors find that first, after introducing the factor of exchange rate, both the implementation effect and prediction ability of the monetary policies improve. Exchange rate has a relatively greater influence on the effect of the monetary policies during low inflation period. Introduction of exchange rate can improve the prediction accuracy of our monetary policies significantly. Second, as the implementation effect of monetary policy under different macro-background varies greatly, the situation should be correctly appraised when formulating and implementing monetary policies. According to the empirical results, the monetary policies have obvious non-linear characteristics, and transit smoothly with the change of inflation rate. On the two sides of inflation rate of 2.174 percent, there is an asymmetry response. Research limitations/implications – Surely, the conclusions are reached on the basis of quarterly data and one-step prediction method. It is no doubt that use of frequency mixing data including quarterly and monthly data will provide more sample information for studying relevant issues. And the use of multiple-step prediction method may cause a dynamic change of prediction indicators of models, which will help choose more appropriate prediction models. That is what the authors will study next. Originality/value – First, by introducing exchange rate, this paper will extend non-linear Taylor rules and test its applicability and fitting effect in China. Second, figure out a non-linear Taylor rule that conforms to Chinese practices with data. In this paper, multiple forms of non-linear Taylor rules and actual macro date will be adopted for fitting and finding out a non-linear Taylor rule that fits Chinese practices. Third, empirical basis will be provided for further perfecting monetary policies prediction models. As there are few studies in connection with the prediction accuracy of non-linear Taylor rules so far, this paper will compare and study the prediction accuracy of non-linear Taylor rules by utilizing multiple advanced prediction techniques, so as to offer a beneficial thinking for predicting and formulating monetary policies by the central bank.


Sign in / Sign up

Export Citation Format

Share Document