Collecting Large-scale Comparative Text Data on Legislative Debates
Parliamentary speeches present one of the most consistently available sources of information about the political priorities, actor positions, and conflict structures in democratic states. Recent advances of automated text analysis offer more and more tools to tap into this information reservoir in a systematic manner. However, collecting the high-quality text data needed for unleashing the comparative potential of the various text analysis algorithms out there is a costly endeavor and faces various pragmatic hurdles. Against this challenge, this chapter offers three contributions. First, we outline best practice guidelines and useful tools for researchers wishing to collect or to extend existing legislative debate corpora. Second, we present an extended version of the ParlSpeech Corpus. Third, we highlight the difficulties of comparing text-as-data outputs across different parliaments, pointing to varying languages, varying traditions and conventions, and varying metadata availability.