In this paper, we discuss the possibility that elementary arithmetic, such as addition and other arithmetic operations, would be useful as a benchmark for measuring the ability to extrapolate which is considered important for the realization of general-purpose intelligence. Understanding addition can be regarded as performing addition of arbitrary digits correctly by memorizing and applying the rules of addition of single digits, and by learning the rules of carrying. We propose a benchmark in which we prepare a small training dataset that is considered sufficient to reveal the algebra of addition, and measure the accuracy using a test dataset that requires large multi-digit operations. Our benchmark has the following advantages over the datasets usually used in recognition tasks and reinforcement learning. Its simple structure makes it easy to generate datasets, adjust and extend the difficulty, identify inductive biases, and allows for discussion from a theoretical perspective in computer science like program synthesis. We hope that the elementary arithmetic benchmark will reveal functions missing in current AI systems, and also provide a good starting point for developing them. In particular, we speculate that the use of the knowledge may be required for a system to compute correctly for arbitrary digits. Finally, based on these insights, we propose a future direction for the development of systems with extrapolation capabilities.