Putting Psychology to the Test: Rethinking Model Evaluation Through Benchmarking and Prediction

Consensus on standards for evaluating models and theories is an integral part of every science. Nonetheless, in psychology, relatively little focus has been placed on defining reliable communal metrics to assess model performance. Evaluation practices are often idiosyncratic and are affected by a number of shortcomings (e.g., failure to assess models’ ability to generalize to unseen data) that make it difficult to discriminate between good and bad models. Drawing inspiration from fields such as machine learning and statistical genetics, we argue in favor of introducing common benchmarks as a means of overcoming the lack of reliable model evaluation criteria currently observed in psychology. We discuss a number of principles benchmarks should satisfy to achieve maximal utility, identify concrete steps the community could take to promote the development of such benchmarks, and address a number of potential pitfalls and concerns that may arise in the course of implementation. We argue that reaching consensus on common evaluation benchmarks will foster cumulative progress in psychology and encourage researchers to place heavier emphasis on the practical utility of scientific models.

Download Full-text

Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction

10.31234/osf.io/e437b ◽

2020 ◽

Cited By ~ 1

Author(s):

Roberta Rocca ◽

Tal Yarkoni

Keyword(s):

Machine Learning ◽

Performance Evaluation ◽

Model Evaluation ◽

Evaluation Criteria ◽

Model Performance ◽

Scientific Models ◽

Practical Utility ◽

Model Performance Evaluation ◽

Unseen Data ◽

Reliable Model

Consensus on standards for evaluating models and theories is an integral part of every science. Nonetheless, in psychology, relatively little focus has been placed on defining reliable communal metrics to assess model performance. Evaluation practices are often idiosyncratic, and are affected by a number of shortcomings (e.g., failure to assess models' ability to generalize to unseen data) that make it difficult to discriminate between good and bad models. Drawing inspiration from fields like machine learning and statistical genetics, we argue in favor of introducing common benchmarks as a means of overcoming the lack of reliable model evaluation criteria currently observed in psychology. We discuss a number of principles benchmarks should satisfy to achieve maximal utility; identify concrete steps the community could take to promote the development of such benchmarks; and address a number of potential pitfalls and concerns that may arise in the course of implementation. We argue that reaching consensus on common evaluation benchmarks will foster cumulative progress in psychology, and encourage researchers to place heavier emphasis on the practical utility of scientific models.

Download Full-text

Comparison of Machine Learning Algorithms in Predicting the COVID-19 Outbreak

10.4018/978-1-7998-8674-7.ch017 ◽

2022 ◽

pp. 320-336

Author(s):

Asiye Bilgili

Keyword(s):

Machine Learning ◽

Performance Evaluation ◽

Health Informatics ◽

Medical Information ◽

Learning Algorithms ◽

Model Performance ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Implementation Phase ◽

Model Performance Evaluation

Health informatics is an interdisciplinary field in the computer and health sciences. Health informatics, which enables the effective use of medical information, has the potential to reduce both the cost and the burden of healthcare workers during the pandemic process. Using the machine learning algorithms support vector machines, naive bayes, k-nearest neighbor, and C4.5 algorithms, a model performance evaluation was performed to identify the algorithm that will show the highest performance for the prediction of the disease. Three separate training and test datasets were created 70% - 30%, 75% - 25%, and 80% - 20%, respectively. The implementation phase of the study was carried out by following the CRISP-DM steps, and the analyses were made using the R language. By examining the model performance evaluation criteria, the findings show that the C4.5 algorithm showed the best performance with 70% training dataset.

Download Full-text