PANDORA Talks: Personality and Demographics on Reddit
Personality and demographics are important variables in social sciences, whilein NLP they can aid in interpretability and removal of societal biases.However, datasets with both personality and demographic labels are scarce. Toaddress this, we present PANDORA, the first large-scale dataset of Reddit commentslabeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. Weshowcase the usefulness of this dataset on three experiments, where we leveragethe more readily available data from other personality models to predict theBig 5 traits, analyze gender classification biases arising frompsycho-demographic variables, and carry out a confirmatory and exploratoryanalysis based on psychological theories. Finally, we present benchmarkprediction models for all personality and demographic variables.