A comparison of distributed machine learning methods for the support of "Many Labs" collaborations in computational modelling of decision making (Preprint)
UNSTRUCTURED Deep learning models, especially RNN models, are potentially powerful tools for representing the complex learning processes and decision-making strategies used by humans. Such neural network models make fewer assumptions about the underlying mechanisms thus providing experimental flexibility in terms of applicability. However this comes at the cost of requiring a larger number of tunable parameters requiring significantly more training and representative data for effective learning. This presents practical challenges given that most computational modelling experiments involve relatively small numbers of subjects, which while adequate for conventional modelling using low dimensional parameter spaces, leads to sub-optimal model training when adopting deeper neural network approaches. Laboratory collaboration is a natural way of increasing data availability however, data sharing barriers among laboratories as necessitated by data protection regulations encourage us to seek alternative methods for collaborative data science. Distributed learning, especially federated learning, which supports the preservation of data privacy, is a promising method for addressing this issue. To verify the reliability and feasibility of applying federated learning to train neural networks models used in the characterisation of human decision making, we conducted experiments on a real-world, many-labs data pool including experimentally significant data-sets from ten independent studies. The performance of single models that were trained on single laboratory data-sets was poor, especially those with small numbers of subjects. This unsurprising finding supports the need for larger and more diverse data-sets to train more generalised and reliable models. To that end we evaluated four collaborative approaches for comparison purposes. The first approach represents conventional centralized data sharing (CL-based) and is the optimal approach but requires complete sharing of data which we wish to avoid. The results however establish a benchmark for the other three distributed approaches; federated learning (FL-based), incremental learning (IL-based), and cyclic incremental learning (CIL-based). We evaluate these approaches in terms of prediction accuracy and capacity to characterise human decision-making strategies in the context of the computational modelling experiments considered here. The results demonstrate that the FL-based model achieves performance most comparable to that of a centralized data sharing approach. This demonstrate that federated learning has value in scaling data science methods to data collected in computational modelling contexts in circumstances where data sharing is not convenient, practical or permissible.