The importance of considering speech perception and language acquisition as a multimodal phenomenon, that is to say an audio-visual phenomenon, can hardly be ignored in light of recent evidence. Research from this perspective has demonstrated that young infants are sensitive to audio-visual match in auditory (i.e. syllables, vowels and utterances) and visual (i.e. mouth movements) native and non-native speech, even when presented sequentially. Over time, as they gain more experience, infants’ perception and processing of native language attributes increases, while this sensitivity seems to decline for non-native attributes (perceptual narrowing). Empirical findings in the field of perceptual narrowing are ambiguous with regard to the beginning and the extent of this tuning phenomenon, but there is evidence that factors such as the richness and presentation of the stimuli play a crucial role. Recently, there has been renewed interest in the topic of face-scanning behavior, mainly because eye-tracking devices have made more objective and precise analyses of infants’ gaze patterns possible. Face-scanning behavior is directly associated with audio-visual speech processing, and both have an impact on infants’ future expressive language development. However, no previous study has ever examined the distance between the native and non-native language in the context of audio-visual speech processing. This is illustrated by the fact that previously studies have exclusively considered more distant languages belonging to different rhythm classes, not closer languages belonging to the same rhythm class. Languages that largely do not differ in global rhythmic-prosodic cues but for instance in more specific phonological and phonetic attributes might impact audio-visual matching and face-scanning behavior in early infancy. This influence might provide insights into how fine-grained these perception and processing mechanisms are marked during infancy, when they narrow in the direction of the infant’s native language, and which facial areas infants draw on at different time points during infancy to obtain enough (redundant) cues to acquire their native language(s). Furthermore, no previous studies have combined a longitudinal perspective on infants with a cross-linguistic view of languages in order to reduce inter-individual differences across age groups and generalize the emergence of perceptual narrowing as a cross-linguistic phenomenon. Hence, the present synopsis comprises three studies that address these perspectives on early audio-visual speech perception of languages belonging to the same rhythm class among infants by investigating early audio-visual matching sensitivities (Study 1), the occurrence of perceptual narrowing (Study 2), and face-scanning behavior during the first year of life and its impact on the infants’ future expressive vocabulary (Study 3). It summarizes the current state of the (empirical) literature in subjects such as speech perception, language discrimination and face-scanning behavior before identifying important research gaps, pointing out relevant research questions, presenting the design(s) and the main results of the three empirical studies, and finally discussing the findings and the consequential possible implications for future research and practice. The studies are based on self-collected data from the Bamberg Baby Institute at the University of Bamberg (Germany) and the Uppsala Child and Baby Lab at Uppsala University (Sweden). Whereas the first and second study were based on a cross-linguistic dataset of German and Swedish infants, the third study’s dataset consisted only of German infants who were further followed longitudinally.