Simplicity, Model Fit, Complexity and Uncertainty in Spatial Prediction Models Applied Over Time: We Are Quite Sure, Aren’t We?

Author(s):  
Falk Huettmann ◽  
Thomas Gottschalk
Author(s):  
Hannah L Combs ◽  
Kate A Wyman-Chick ◽  
Lauren O Erickson ◽  
Michele K York

Abstract Objective Longitudinal assessment of cognitive and emotional functioning in patients with Parkinson’s disease (PD) is helpful in tracking progression of the disease, developing treatment plans, evaluating outcomes, and educating patients and families. Determining whether change over time is meaningful in neurodegenerative conditions, such as PD, can be difficult as repeat assessment of neuropsychological functioning is impacted by factors outside of cognitive change. Regression-based prediction formulas are one method by which clinicians and researchers can determine whether an observed change is meaningful. The purpose of the current study was to develop and validate regression-based prediction models of cognitive and emotional test scores for participants with early-stage idiopathic PD and healthy controls (HC) enrolled in the Parkinson’s Progression Markers Initiative (PPMI). Methods Participants with de novo PD and HC were identified retrospectively from the PPMI archival database. Data from baseline testing and 12-month follow-up were utilized in this study. In total, 688 total participants were included in the present study (NPD = 508; NHC = 185). Subjects from both groups were randomly divided into development (70%) and validation (30%) subsets. Results Early-stage idiopathic PD patients and healthy controls were similar at baseline. Regression-based models were developed for all cognitive and self-report mood measures within both populations. Within the validation subset, the predicted and observed cognitive test scores did not significantly differ, except for semantic fluency. Conclusions The prediction models can serve as useful tools for researchers and clinicians to study clinically meaningful cognitive and mood change over time in PD.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Noah DeWitt ◽  
Mohammed Guedira ◽  
Edwin Lauer ◽  
J. Paul Murphy ◽  
David Marshall ◽  
...  

Abstract Background Genetic variation in growth over the course of the season is a major source of grain yield variation in wheat, and for this reason variants controlling heading date and plant height are among the best-characterized in wheat genetics. While the major variants for these traits have been cloned, the importance of these variants in contributing to genetic variation for plant growth over time is not fully understood. Here we develop a biparental population segregating for major variants for both plant height and flowering time to characterize the genetic architecture of the traits and identify additional novel QTL. Results We find that additive genetic variation for both traits is almost entirely associated with major and moderate-effect QTL, including four novel heading date QTL and four novel plant height QTL. FT2 and Vrn-A3 are proposed as candidate genes underlying QTL on chromosomes 3A and 7A, while Rht8 is mapped to chromosome 2D. These mapped QTL also underlie genetic variation in a longitudinal analysis of plant growth over time. The oligogenic architecture of these traits is further demonstrated by the superior trait prediction accuracy of QTL-based prediction models compared to polygenic genomic selection models. Conclusions In a population constructed from two modern wheat cultivars adapted to the southeast U.S., almost all additive genetic variation in plant growth traits is associated with known major variants or novel moderate-effect QTL. Major transgressive segregation was observed in this population despite the similar plant height and heading date characters of the parental lines. This segregation is being driven primarily by a small number of mapped QTL, instead of by many small-effect, undetected QTL. As most breeding populations in the southeast U.S. segregate for known QTL for these traits, genetic variation in plant height and heading date in these populations likely emerges from similar combinations of major and moderate effect QTL. We can make more accurate and cost-effective prediction models by targeted genotyping of key SNPs.


2021 ◽  
Vol 12 (6) ◽  
pp. 1-19
Author(s):  
Yifan He ◽  
Zhao Li ◽  
Lei Fu ◽  
Anhui Wang ◽  
Peng Zhang ◽  
...  

In the emerging business of food delivery, rider traffic accidents raise financial cost and social traffic burden. Although there has been much effort on traffic accident forecasting using temporal-spatial prediction models, none of the existing work studies the problem of detecting the takeaway rider accidents based on food delivery trajectory data. In this article, we aim to detect whether a takeaway rider meets an accident on a certain time period based on trajectories of food delivery and riders’ contextual information. The food delivery data has a heterogeneous information structure and carries contextual information such as weather and delivery history, and trajectory data are collected as a spatial-temporal sequence. In this article, we propose a TakeAway Rider Accident detection fusion network TARA-Net to jointly model these heterogeneous and spatial-temporal sequence data. We utilize the residual network to extract basic contextual information features and take advantage of a transformer encoder to capture trajectory features. These embedding features are concatenated into a pyramidal feed-forward neural network. We jointly train the above three components to combine the benefits of spatial-temporal trajectory data and sparse basic contextual data for early detecting traffic accidents. Furthermore, although traffic accidents rarely happen in food delivery, we propose a sampling mechanism to alleviate the imbalance of samples when training the model. We evaluate the model on a transportation mode classification dataset Geolife and a real-world Ele.me dataset with over 3 million riders. The experimental results show that the proposed model is superior to the state-of-the-art.


2021 ◽  
Author(s):  
Sebastian Johannes Fritsch ◽  
Konstantin Sharafutdinov ◽  
Moein Einollahzadeh Samadi ◽  
Gernot Marx ◽  
Andreas Schuppert ◽  
...  

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.


2019 ◽  
Vol 57 (3) ◽  
pp. 271-286 ◽  
Author(s):  
Gema Casal ◽  
Paul Harris ◽  
Xavier Monteys ◽  
John Hedley ◽  
Conor Cahalane ◽  
...  

2020 ◽  
Author(s):  
Hanna Meyer ◽  
Edzer Pebesma

<p>Spatial mapping is an important task in environmental science to reveal spatial patterns and changes of the environment. In this context predictive modelling using flexible machine learning algorithms has become very popular. However, looking at the diversity of modelled (global) maps of environmental variables, there might be increasingly the impression that machine learning is a magic tool to map everything. Recently, the reliability of such maps have been increasingly questioned, calling for a reliable quantification of uncertainties.</p><p>Though spatial (cross-)validation allows giving a general error estimate for the predictions, models are usually applied to make predictions for a much larger area or might even be transferred to make predictions for an area where they were not trained on. But by making predictions on heterogeneous landscapes, there will be areas that feature environmental properties that have not been observed in the training data and hence not learned by the algorithm. This is problematic as most machine learning algorithms are weak in extrapolations and can only make reliable predictions for environments with conditions the model has knowledge about. Hence predictions for environmental conditions that differ significantly from the training data have to be considered as uncertain.</p><p>To approach this problem, we suggest a measure of uncertainty that allows identifying locations where predictions should be regarded with care. The proposed uncertainty measure is based on distances to the training data in the multidimensional predictor variable space. However, distances are not equally relevant within the feature space but some variables are more important than others in the machine learning model and hence are mainly responsible for prediction patterns. Therefore, we weight the distances by the model-derived importance of the predictors. </p><p>As a case study we use a simulated area-wide response variable for Europe, bio-climatic variables as predictors, as well as simulated field samples. Random Forest is applied as algorithm to predict the simulated response. The model is then used to make predictions for entire Europe. We then calculate the corresponding uncertainty and compare it to the area-wide true prediction error. The results show that the uncertainty map reflects the patterns in the true error very well and considerably outperforms ensemble-based standard deviations of predictions as indicator for uncertainty.</p><p>The resulting map of uncertainty gives valuable insights into spatial patterns of prediction uncertainty which is important when the predictions are used as a baseline for decision making or subsequent environmental modelling. Hence, we suggest that a map of distance-based uncertainty should be given in addition to prediction maps.</p>


2018 ◽  
Vol 57 (1) ◽  
pp. 51-79 ◽  
Author(s):  
D. W. Wanik ◽  
E. N. Anagnostou ◽  
M. Astitha ◽  
B. M. Hartman ◽  
G. M. Lackmann ◽  
...  

AbstractHurricane Sandy (2012, referred to as Current Sandy) was among the most devastating storms to impact Connecticut’s overhead electric distribution network, resulting in over 15 000 outage locations that affected more than 500 000 customers. In this paper, the severity of tree-caused outages in Connecticut is estimated under future-climate Hurricane Sandy simulations, each exhibiting strengthened winds and heavier rain accumulation over the study area from large-scale thermodynamic changes in the atmosphere and track changes in the year ~2100 (referred to as Future Sandy). Three machine-learning models used five weather simulations and the ensemble mean of Current and Future Sandy, along with land-use and overhead utility infrastructure data, to predict the severity and spatial distribution of outages across the Eversource Energy service territory in Connecticut. To assess the influence of increased precipitation from Future Sandy, two approaches were compared: an outage model fit with a full set of variables accounting for both wind and precipitation, and a reduced set with only wind. Future Sandy displayed an outage increase of 42%–64% when using the ensemble of WRF simulations fit with three different outage prediction models. This study is a proof of concept for the assessment of increased outage risk resulting from potential changes in tropical cyclone intensity associated with late-century thermodynamic changes driven by the IPCC AR4 A2 emissions scenario.


Sign in / Sign up

Export Citation Format

Share Document