scholarly journals Interspecific Sample Prioritization Can Improve QTL Detection With Tree-Based Predictive Models

2021 ◽  
Vol 12 ◽  
Author(s):  
Min-Gyoung Shin ◽  
Sergey V. Nuzhdin

Due to increasing demand for new advanced crops, considerable efforts have been made to explore the improvement of stress and disease resistance cultivar traits through the study of wild crops. When both wild and interspecific hybrid materials are available, a common approach has been to study two types of materials separately and simply compare the quantitative trait locus (QTL) regions. However, combining the two types of materials can potentially create a more efficient method of finding predictive QTLs. In this simulation study, we focused on scenarios involving causal marker expression suppressed by trans-regulatory mechanisms, where the otherwise easily lost associated signals benefit the most from combining the two types of data. A probabilistic sampling approach was used to prioritize consistent genotypic phenotypic patterns across both types of data sets. We chose random forest and gradient boosting to apply the prioritization scheme and found that both facilitated the investigation of predictive causal markers in most of the biological scenarios simulated.

2000 ◽  
Vol 51 (4) ◽  
pp. 515 ◽  
Author(s):  
M. R. Shariflou ◽  
C. Moran ◽  
F. W. Nicholas

The occurrence of the Leu127/Val127 variants of the bovine growth hormone (bGH) gene and their effect on milk production traits was investigated in Australian Holstein-Friesian cattle. Animals were genotyped for the Leu127/Val127 variants, with RFLP methodology, using PCR and AluI digestion of PCR products (AluI-RFLP). Alleles Leu127 and Val127 occurred with frequencies of 82% and 18%, respectively. The quantitative effect of this polymorphic site on milk-production traits was estimated from lactation data and test-day data. Results from the 2 data sets consistently showed that the Leu127 allele is associated with higher production of milk, fat, and protein and is dominant to Val127. The average effects of the gene substitution are 95 L for milk yield, 7 kg for fat yield, and 3 kg for protein yield per lactation. This locus may be directly responsible for quantitative variation or it may be a marker for a closely linked quantitative trait locus (QTL) for milk-production traits in Australian dairy cattle. In either case, it will be useful as an aid to selection for improvement of milk production traits. As the Leu127 allele is dominant, selection of AI sires homozygous for the Leu127 allele (Leu127/Leu127) will result in maximum benefit without the need for genotyping cows.


2019 ◽  
Vol 15 (2) ◽  
pp. 201-214 ◽  
Author(s):  
Mahmoud Elish

Purpose Effective and efficient software security inspection is crucial as the existence of vulnerabilities represents severe risks to software users. The purpose of this paper is to empirically evaluate the potential application of Stochastic Gradient Boosting Trees (SGBT) as a novel model for enhanced prediction of vulnerable Web components compared to common, popular and recent machine learning models. Design/methodology/approach An empirical study was conducted where the SGBT and 16 other prediction models have been trained, optimized and cross validated using vulnerability data sets from multiple versions of two open-source Web applications written in PHP. The prediction performance of these models have been evaluated and compared based on accuracy, precision, recall and F-measure. Findings The results indicate that the SGBT models offer improved prediction over the other 16 models and thus are more effective and reliable in predicting vulnerable Web components. Originality/value This paper proposed a novel application of SGBT for enhanced prediction of vulnerable Web components and showed its effectiveness.


2018 ◽  
Vol 11 (11) ◽  
pp. 6203-6230 ◽  
Author(s):  
Simon Ruske ◽  
David O. Topping ◽  
Virginia E. Foot ◽  
Andrew P. Morse ◽  
Martin W. Gallagher

Abstract. Primary biological aerosol including bacteria, fungal spores and pollen have important implications for public health and the environment. Such particles may have different concentrations of chemical fluorophores and will respond differently in the presence of ultraviolet light, potentially allowing for different types of biological aerosol to be discriminated. Development of ultraviolet light induced fluorescence (UV-LIF) instruments such as the Wideband Integrated Bioaerosol Sensor (WIBS) has allowed for size, morphology and fluorescence measurements to be collected in real-time. However, it is unclear without studying instrument responses in the laboratory, the extent to which different types of particles can be discriminated. Collection of laboratory data is vital to validate any approach used to analyse data and ensure that the data available is utilized as effectively as possible. In this paper a variety of methodologies are tested on a range of particles collected in the laboratory. Hierarchical agglomerative clustering (HAC) has been previously applied to UV-LIF data in a number of studies and is tested alongside other algorithms that could be used to solve the classification problem: Density Based Spectral Clustering and Noise (DBSCAN), k-means and gradient boosting. Whilst HAC was able to effectively discriminate between reference narrow-size distribution PSL particles, yielding a classification error of only 1.8 %, similar results were not obtained when testing on laboratory generated aerosol where the classification error was found to be between 11.5 % and 24.2 %. Furthermore, there is a large uncertainty in this approach in terms of the data preparation and the cluster index used, and we were unable to attain consistent results across the different sets of laboratory generated aerosol tested. The lowest classification errors were obtained using gradient boosting, where the misclassification rate was between 4.38 % and 5.42 %. The largest contribution to the error, in the case of the higher misclassification rate, was the pollen samples where 28.5 % of the samples were incorrectly classified as fungal spores. The technique was robust to changes in data preparation provided a fluorescent threshold was applied to the data. In the event that laboratory training data are unavailable, DBSCAN was found to be a potential alternative to HAC. In the case of one of the data sets where 22.9 % of the data were left unclassified we were able to produce three distinct clusters obtaining a classification error of only 1.42 % on the classified data. These results could not be replicated for the other data set where 26.8 % of the data were not classified and a classification error of 13.8 % was obtained. This method, like HAC, also appeared to be heavily dependent on data preparation, requiring a different selection of parameters depending on the preparation used. Further analysis will also be required to confirm our selection of the parameters when using this method on ambient data. There is a clear need for the collection of additional laboratory generated aerosol to improve interpretation of current databases and to aid in the analysis of data collected from an ambient environment. New instruments with a greater resolution are likely to improve on current discrimination between pollen, bacteria and fungal spores and even between different species, however the need for extensive laboratory data sets will grow as a result.


Author(s):  
Fereshteh Shahoveisi ◽  
Atena Oladzad ◽  
Luis E. del Rio Mendoza ◽  
Seyedali Hosseinirad ◽  
Susan Ruud ◽  
...  

The polyploid nature of canola (Brassica napus) represents a challenge for the accurate identification of single nucleotide polymorphisms (SNPs) and the detection of quantitative trait loci (QTL). In this study, combinations of eight phenotyping scoring systems and six SNP calling and filtering parameters were evaluated for their efficiency in detection of QTL associated with response to Sclerotinia stem rot, caused by Sclerotinia sclerotiorum, in two doubled haploid (DH) canola mapping populations. Most QTL were detected in lesion length, relative areas under the disease progress curve (rAUDPC) for lesion length, and binomial-plant mortality data sets. Binomial data derived from lesion size were less efficient in QTL detection. Inclusion of additional phenotypic sets to the analysis increased the numbers of significant QTL by 2.3-fold; however, the continuous data sets were more efficient. Between two filtering parameters used to analyze genotyping by sequencing (GBS) data, imputation of missing data increased QTL detection in one population with a high level of missing data but not in the other. Inclusion of segregation-distorted SNPs increased QTL detection but did not impact their R2 values significantly. Twelve of the 16 detected QTL were on chromosomes A02 and C01, and the rest were on A07, A09, and C03. Marker A02-7594120, associated with a QTL on chromosome A02 was detected in both populations. Results of this study suggest the impact of genotypic variant calling and filtering parameters may be population dependent while deriving additional phenotyping scoring systems such as rAUDPC datasets and mortality binary may improve QTL detection efficiency.


1985 ◽  
Vol 65 (1) ◽  
pp. 109-122 ◽  
Author(s):  
L. M. DWYER ◽  
H. N. HAYHOE

Estimates of monthly soil temperatures under short-grass cover across Canada using a macroclimatic model (Ouellet 1973a) were compared to monthly averages of soil temperatures monitored over winter at Ottawa between November 1959 and April 1981. Although the fit between monthly estimates and Ottawa observations was generally good (R for all months and depths 0.10, 0.20, 0.50, 1.00 and 1.50 m was 0.90), it was noted that midwinter estimates were generally below observed temperatures at all soil depths. Data sets used in the development of the original Ouellet (1973a) multiple regression equations were collected from stations across Canada, many of which have reduced snow cover. It was found that the buffering capability of the snow cover accumulated at Ottawa during the winter months was underestimated by the pertinent partial regression coefficients in these equations. The coefficients were therefore modified for the Ottawa station during the winter months. The resultant regression models were used to estimate soil temperature during the winters of 1981–1982 and 1982–1983. Although the Ottawa-based models included fewer variables because of the smaller data base available from a single site, comparisons of model estimates and observations were good (R = 0.84 and 0.91) and midwinter estimates were not consistently underestimated as they were using the original Ouellet (1973a) model. Reliable monthly estimates of soil temperatures are important since they are a necessary input to more detailed predictive models of daily soil temperatures. Key words: Regression model, snowcover, stepwise regression, variable selection


2017 ◽  
Vol 13 (S335) ◽  
pp. 58-64 ◽  
Author(s):  
Hebe Cremades

AbstractSophisticated instrumentation dedicated to studying and monitoring our Sun’s activity has proliferated in the past few decades, together with the increasing demand of specialized space weather forecasts that address the needs of commercial and government systems. As a result, theoretical and empirical models and techniques of increasing complexity have been developed, aimed at forecasting the occurrence of solar disturbances, their evolution, and time of arrival to Earth. Here we will review groundbreaking and recent methods to predict the propagation and evolution of coronal mass ejections and their driven shocks. The methods rely on a wealth of data sets provided by ground- and space-based observatories, involving remote-sensing observations of the corona and the heliosphere, as well as detections of radio waves.


Author(s):  
Dileep Kumar G.

Tree-based learning techniques are considered to be one of the best and most used supervised learning methods. Tree-based methods empower predictive models with high accuracy, stability, and ease of interpretation. Unlike linear models, they map non-linear relationships pretty well. These methods are adaptable at solving any kind of problem at hand (classification or regression). Methods like decision trees, random forest, gradient boosting are being widely used in all kinds of machine learning and data science problems. Hence, for every data analyst, it is important to learn these algorithms and use them for modeling. This chapter guide the learner to learn tree-based modeling techniques from scratch.


Author(s):  
Antonia J. Jones ◽  
Dafydd Evans ◽  
Steve Margetts ◽  
Peter J. Durrant

The Gamma Test is a non-linear modelling analysis tool that allows us to quantify the extent to which a numerical input/output data set can be expressed as a smooth relationship. In essence, it allows us to efficiently calculate that part of the variance of the output that cannot be accounted for by the existence of any smooth model based on the inputs, even though this model is unknown. A key aspect of this tool is its speed: the Gamma Test has time complexity O(Mlog M), where M is the number of datapoints. For data sets consisting of a few thousand points and a reasonable number of attributes, a single run of the Gamma Test typically takes a few seconds. In this chapter we will show how the Gamma Test can be used in the construction of predictive models and classifiers for numerical data. In doing so, we will demonstrate the use of this technique for feature selection, and for the selection of embedding dimension when dealing with a time-series.


2019 ◽  
pp. 215-220
Author(s):  
X. de Badts ◽  
V. Dumas ◽  
N. Jaegli ◽  
L. Ley ◽  
D. Merdinoglu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document