Pitting corpus-based classification models against each other: a case study for predicting constructional choice in written Estonian
AbstractIn the context of constructional alternatives, we may assume that speakers’ choice between alternative forms is influenced by a multitude of factors. At the moment, multivariate statistical classification modelling seems to be the best tool available to capture this knowledge quantitatively. There is a vast array of techniques available. In this paper, two distinct modelling techniques are applied – logistic regression and naïve discriminative learning – to predict the choice between two constructional alternatives in written Estonian. One of the central questions in statistical modelling concerns the evaluation of model fit. It is proposed that for linguistic analysis, the performance of alternative corpus-based models can be evaluated by, first, pitting them against each other and second, pitting them against experimental data. Previous work on modelling constructional and lexical choice has focused on one of the two aspects. The present paper takes this line of analysis further by combining the two approaches.