Predicting Educational Background using Text Mining
We examine to what extent educational background can be inferred from written text, assuming that educational levels are associated with the style of writing and use of language. Using a large public dataset of almost 60000 dating profiles, containing written text for each profile, we look for a methodology to measure author style. We focus on education and essays fields in each profile from which we try to identify relevant features of written text that reveal the level of education of authors behind texts. Using different types of extracted features, we explore the level of education within three approaches: (i) classifying the level of education to elementary or higher education using lexical features; (ii) using Linguistic Inquiry and Word Count (LIWC) features; (iii) combining LIWC features and lexical features. For classification, we rely on regularized logistic regression. The joint model which uses both lexical and LIWC features predicts the education level better than other text representation models, but the contribution of LIWC is marginal. Our results may not only be useful in the context of the platform economy and online markets, also more generally to researchers who need to rely on written text as an indicator of educational background.