Medical scientists employ ‘quality assessment tools’ to assess evidence from medical research, especially from randomized trials. These tools are designed to take into account methodological details of studies, including randomization, subject allocation concealment, and other features of studies deemed relevant to minimizing bias. There are dozens of such tools available. They differ widely from each other, and empirical studies show that they have low inter-rater reliability and low inter-tool reliability. This is an instance of a more general problem called here the underdetermination of evidential significance. Disagreements about the quality of evidence can be due to different—but in principle equally good—weightings of the methodological features that constitute quality assessment tools. Thus, the malleability of empirical research in medicine is deep: in addition to the malleability of first-order empirical methods, such as randomized trials, there is malleability in the tools used to evaluate first-order methods.