The World Wide Web as Complex Data Set: Expanding the Digital Humanities into the Twentieth Century and Beyond through Internet Research
While intellectual property protections effectively frame digital humanities text mining as a field primarily for the study of the nineteenth century, the Internet offers an intriguing object of study for humanists working in later periods. As a complex data source, the World Wide Web presents its own methodological challenges for digital humanists, but lessons learned from projects studying large nineteenth century corpora offer helpful starting points. Complicating matters further, legal and ethical questions surrounding web scraping, or the practice of large scale data retrieval over the Internet, will require humanists to frame their research to distinguish it from commercial and malicious activities. This essay reviews relevant research in the digital humanities and new media studies in order to show how web scraping might contribute to humanities research questions. In addition to recommendations for addressing the complex concerns surrounding web scraping this essay also provides a basic overview of the process and some recommendations for resources.