Recently, deep metric learning (DML) has received widespread attention in the field of remote sensing image retrieval (RSIR), owing to its ability to extract discriminative features to represent images and then to measure the similarity between images via learning a distance function among feature vectors. However, the distinguishability of features extracted by the most current DML-based methods for RSIR is still not sufficient, and the retrieval efficiency needs to be further improved. To this end, we propose a novel ensemble architecture of residual attention-based deep metric learning (EARA) for RSIR. In our proposed architecture, residual attention is introduced and ameliorated to increase feature discriminability, maintain global features, and concatenate feature vectors of different weights. Then, descriptor ensemble rather than embedding ensemble is chosen to further boost the performance of RSIR with reduced time cost and memory consumption. Furthermore, our proposed architecture can be flexibly extended with different types of deep neural networks, loss functions, and feature descriptors. To evaluate the performance and efficiency of our architecture, we conduct exhaustive experiments on three benchmark remote sensing datasets, including UCMD, SIRI-WHU, and AID. The experimental results demonstrate that the proposed architecture outperforms the four state-of-the-art methods, including BIER, A-BIER, DCES, and ABE, by 15.45%, 13.04%, 10.31%, and 6.62% in the mean Average Precision (mAP), respectively. As for the retrieval execution complexity, the retrieval time and floating point of operations (FLOPs), needed by the proposed architecture on AID, reduce by 92% and 80% compared to those needed by ABE, albeit with the same Recall@1 between the two methods.