Stimulus locations are detected differently by different sensory systems, but ultimately they yield similar percepts and behavioral responses. How the brain transcends initial differences to compute similar codes is unclear. We quantitatively compared the reference frames of two sensory modalities, vision and audition, across three interconnected brain areas involved in generating saccades, namely the frontal eye fields (FEF), lateral and medial parietal cortex (M/LIP), and superior colliculus (SC). We recorded from single neurons in head-restrained monkeys performing auditory- and visually-guided saccades from variable initial fixation locations, and evaluated whether their receptive fields were better described as eye-centered, head-centered, or hybrid (i.e. not anchored uniquely to head- or eye-orientation). We found a progression of reference frames across areas and across time, with considerable hybrid-ness and persistent differences between modalities during most epochs/brain regions. For both modalities, the SC was more eye-centered than the FEF, which in turn was more eye-centered than the predominantly hybrid M/LIP. In all three areas and temporal epochs from stimulus onset to movement, visual signals were more eye-centered than auditory signals. In the SC and FEF, auditory signals became more eye-centered at the time of the saccade than they were initially after stimulus onset, but only in the SC at the time of the saccade did the auditory signals become predominantly eye-centered. The results indicate that visual and auditory signals both undergo transformations, ultimately reaching the same final reference frame but via different dynamics across brain regions and time.