Models, forests, and trees of York English: Was/were variation as a case study for statistical practice

2012 ◽  
Vol 24 (2) ◽  
pp. 135-178 ◽  
Author(s):  
Sali A. Tagliamonte ◽  
R. Harald Baayen

AbstractWhat is the explanation for vigorous variation between was and were in plural existential constructions, and what is the optimal tool for analyzing it? Previous studies of this phenomenon have used the variable rule program, a generalized linear model; however, recent developments in statistics have introduced new tools, including mixed-effects models, random forests, and conditional inference trees that may open additional possibilities for data exploration, analysis, and interpretation. In a step-by-step demonstration, we show how this well-known variable benefits from these complementary techniques. Mixed-effects models provide a principled way of assessing the importance of random-effect factors such as the individuals in the sample. Random forests provide information about the importance of predictors, whether factorial or continuous, and do so also for unbalanced designs with high multicollinearity, cases for which the family of linear models is less appropriate. Conditional inference trees straightforwardly visualize how multiple predictors operate in tandem. Taken together, the results confirm that polarity, distance from verb to plural element, and the nature of the DP are significant predictors. Ongoing linguistic change and social reallocation via morphologization are operational. Furthermore, the results make predictions that can be tested in future research. We conclude that variationist research can be substantially enriched by an expanded tool kit.

2019 ◽  
Vol 11 (4) ◽  
pp. 555-581
Author(s):  
JOAN BYBEE ◽  
RICARDO NAPOLEÃO DE SOUZA

abstractUsing ten English adjectives, this study tests the hypothesis that the vowels in adjectives in predicative constructions are longer than those in attributive constructions in spoken conversation. The analyses considered a number of factors: occurrence before a pause, lexical adjective, vowel identity, probability given surrounding words, and others. Two sets of statistical techniques were used: a Mixed-effects model and the Random Forest Analysis based on Conditional Inference Trees (CIT). Both analyses showed strong effects of predicative vs. attributive constructions and individual lexical adjectives on vowel duration in the predicted direction, as well as effects of many of the phonological variables tested. The results showed that the longer duration in the predicative construction is not due to lengthening before a pause, though it is related to whether the adjective is internal or final in the predicative construction. Nor is the effect attributable solely to the probability of the occurrence of the adjective; rather construction type has to be taken into account. The two statistical techniques complement each other, with the Mixed-effects model showing very general trends over all the data, and the Random Forest / CIT analysis showing factors that affect only subsets of the data.


2021 ◽  
Author(s):  
Dylan G.E. Gomes

AbstractAs generalized linear mixed-effects models (GLMMs) have become a widespread tool in ecology, the need to guide the use of such tools is increasingly important. One common guideline is that one needs at least five levels of a random effect. Having such few levels makes the estimation of the variance of random effects terms (such as ecological sites, individuals, or populations) difficult, but it need not muddy one’s ability to estimate fixed effects terms – which are often of primary interest in ecology. Here, I simulate ecological datasets and fit simple models and show that having too few random effects terms does not influence the parameter estimates or uncertainty around those estimates for fixed effects terms. Thus, it should be acceptable to use fewer levels of random effects if one is not interested in making inference about the random effects terms (i.e. they are ‘nuisance’ parameters used to group non-independent data). I also use simulations to assess the potential for pseudoreplication in (generalized) linear models (LMs), when random effects are explicitly ignored and find that LMs do not show increased type-I errors compared to their mixed-effects model counterparts. Instead, LM uncertainty (and p values) appears to be more conservative in an analysis with a real ecological dataset presented here. These results challenge the view that it is never appropriate to model random effects terms with fewer than five levels – specifically when inference is not being made for the random effects, but suggest that in simple cases LMs might be robust to ignored random effects terms. Given the widespread accessibility of GLMMs in ecology and evolution, future simulation studies and further assessments of these statistical methods are necessary to understand the consequences of both violating and blindly following simple guidelines.


Forests ◽  
2018 ◽  
Vol 10 (1) ◽  
pp. 20 ◽  
Author(s):  
Philipp Kilham ◽  
Christoph Hartebrodt ◽  
Gerald Kändler

Wood supply predictions from forest inventories involve two steps. First, it is predicted whether harvests occur on a plot in a given time period. Second, for plots on which harvests are predicted to occur, the harvested volume is predicted. This research addresses this second step. For forests with more than one species and/or forests with trees of varying dimensions, overall harvested volume predictions are not satisfactory and more detailed predictions are required. The study focuses on southwest Germany where diverse forest types are found. Predictions are conducted for plots on which harvests occurred in the 2002–2012 period. For each plot, harvest probabilities of sample trees are predicted and used to derive the harvested volume (m³ over bark in 10 years) per hectare. Random forests (RFs) have become popular prediction models as they define the interactions and relationships of variables in an automatized way. However, their suitability for predicting harvest probabilities for inventory sample trees is questionable and has not yet been examined. Generalized linear mixed models (GLMMs) are suitable in this context as they can account for the nested structure of tree-level data sets (trees nested in plots). It is unclear if RFs can cope with this data structure. This research aims to clarify this question by comparing two RFs—an RF based on conditional inference trees (CTree-RF), and an RF based on classification and regression trees (CART-RF)—with a GLMM. For this purpose, the models were fitted on training data and evaluated on an independent test set. Both RFs achieved better prediction results than the GLMM. Regarding plot-level harvested volumes per ha, they achieved higher variances explained (VEs) and significantly (p < 0.05) lower mean absolute residuals when compared to the GLMM. VEs were 0.38 (CTree-RF), 0.37 (CART-RF), and 0.31 (GLMM). Root means squared errors were 138.3, 139.9 and 145.5, respectively. The research demonstrates the suitability and advantages of RFs for predicting harvest decisions on the level of inventory sample trees. RFs can become important components within the generation of business-as-usual wood supply scenarios worldwide as they are able to learn and predict harvest decisions from NFIs in an automatized and self-adapting way. The applied approach is not restricted to specific forests or harvest regimes and delivers detailed species and dimension information for the harvested volumes.


2018 ◽  
Author(s):  
Van Rynald T Liceralde ◽  
Peter C. Gordon

Power transforms have been increasingly used in linear mixed-effects models (LMMs) of chronometric data (e.g., response times [RTs]) as a statistical solution to preempt violating the assumption of residual normality. However, differences in results between LMMs fit to raw RTs and transformed RTs have reignited discussions on issues concerning the transformation of RTs. Here, we analyzed three word-recognition megastudies and performed Monte Carlo simulations to better understand the consequences of transforming RTs in LMMs. Within each megastudy, transforming RTs produced different fixed- and random-effect patterns; across the megastudies, RTs were optimally normalized by different power transforms, and results were more consistent among LMMs fit to raw RTs. Moreover, the simulations showed that LMMs fit to optimally normalized RTs had greater power for main effects in smaller samples, but that LMMs fit to raw RTs had greater power for interaction effects as sample sizes increased, with negligible differences in Type I error rates between the two models. Based on these results, LMMs should be fit to raw RTs when there is no compelling reason beyond nonnormality to transform RTs and when the interpretive framework mapping the predictors and RTs treats RT as an interval scale.


2010 ◽  
Vol 40 (10) ◽  
pp. 2015-2026 ◽  
Author(s):  
H.G. Pearce ◽  
W.R. Anderson ◽  
L.G. Fogarty ◽  
C.L. Todoroki ◽  
S.A.J. Anderson

Shrubland biomass is important for fire management programmes and for carbon estimates. Aboveground biomass and the combustible portion of biomass, the fuel load, in the past have been measured using destructive techniques. These techniques are detailed, highly labour intensive, and costly; hence, an alternative approach was sought. The new approach used linear mixed-effects models to estimate biomass and fuel loads from easily measured field variables: shrub overstorey height and cover, and understorey height and cover. Site was regarded as a random effect. Sampling sites were located throughout New Zealand and included a range of shrubland vegetation types: manuka ( Leptospermum scoparium J.R. Forst. et G. Forst.) and kanuka ( Kunzea ericoides (A. Rich.) J. Thomps.) scrub and heath, pakihi (mixed low heath, fern, and rushes), and gorse ( Ulex europaeus L.). The approach was extended and confidence intervals were constructed for the regression models. Statistical analysis showed that understorey height and overstorey cover were significant (at the 5% level) in some cases. Overstorey height was highly significant in all cases (p < 0.0001), allowing development of models useful to the operational user. The models allow rapid estimation of average fuel loads or biomass on new sites, and double sampling theory can be applied to calculate the error in the resultant biomass estimate.


2013 ◽  
Vol 32 (5) ◽  
pp. 1187-1195 ◽  
Author(s):  
Sebastian Strempel ◽  
Monika Nendza ◽  
Martin Scheringer ◽  
Konrad Hungerbühler

Sign in / Sign up

Export Citation Format

Share Document