Exploring Big Data Analytic Approaches to Cancer Blog Text Analysis (Preprint)
BACKGROUND In recent years researchers have begun to realize the value of social media as a source for data that helps us understand health-related phenomena. Health blogs in particular are rich with information for decision-making. While there are web crawlers and blog analysis software that generate statistics related to blogs, these are relatively primitive and are not useful computationally to aid with the analysis and understanding of the social networks and medical blogs that are evolving around healthcare. There is a need for sophisticated tools to fill this gap. Furthermore, to our knowledge there are not many big data studies or applications in the text analytics of cancer blogs. This study attempts to fill this specific gap while analyzing cancer blogs. OBJECTIVE In this exploratory research, we examine the potential of applying big data analytic techniques to the analysis of blogs that exist in the cancer domain. Our objective is twofold: to extract from the blogs, patterns and insight about cancer diagnosis, treatment, and management; and to apply advanced computation techniques in processing large amounts of unstructured health data. METHODS We applied the big data analytics architecture of Hadoop MapReduce via the Cloudera platform to the analysis of cancer blog content, in order to extract patterns and insight on cancer diagnoses. We apply a series of algorithms to gain insight into the content and develop a vocabulary and taxonomy of keywords based on existing medical nomenclature. By applying a number of algorithms, we gained insight into the blog content. The study identifies, for instance, the most discussed topics as well as associations that relate to key phenomena RESULTS Using several text analytic algorithms, including word count, word association, clustering, and classification, we were able to identify and analyze the patterns and keywords in cancer blog postings. This gave insight into some of the key issues that are discussed in blogs such as the type of cancer (breast cancer being the dominant topic), diagnosis, treatments, and others. CONCLUSIONS In general, big data analytics has the potential to transform the way practitioners and researchers gain insight from health social media, especially those in free text, unstructured form. Big data analytics and applications in health-related social media are still at an early stage, and rapid acceleration is possible with the advancements in models, tools, and technologies.