Information extraction from digital social trace data with applications to social media and scholarly communication data

Information extraction (IE) aims at extracting structured data from unstructured or semi-structured data. The thesis starts by identifying social media data and scholarly communication data as a special case of digital social trace data (DSTD). This identification allows us to utilize the graph structure of the data (e.g., user connected to a tweet, author connected to a paper, author connected to authors, etc.) for developing new information extraction tasks. The thesis focuses on information extraction from DSTD, first, using only the text data from tweets and scholarly paper abstracts, and then using the full graph structure of Twitter and scholarly communications datasets. This thesis makes three major contributions. First, new IE tasks based on DSTD representation of the data are introduced. For scholarly communication data, methods are developed to identify article and author level novelty [Mishra and Torvik, 2016] and expertise. Furthermore, interfaces for examining the extracted information are introduced. A social communication temporal graph (SCTG) is introduced for comparing different communication data like tweets tagged with sentiment, tweets about a search query, and Facebook group posts. For social media, new text classification categories are introduced, with the aim of identifying enthusiastic and supportive users, via their tweets. Additionally, the correlation between sentiment classes and Twitter meta-data in public corpora is analyzed, leading to the development of a better model for sentiment classification [Mishra and Diesner, 2018]. Second, methods are introduced for extracting information from social media and scholarly data. For scholarly data, a semi-automatic method is introduced for the construction of a large-scale taxonomy of computer science concepts. The method relies on the Wikipedia category tree. The constructed taxonomy is used for identifying key computer science phrases in scholarly papers, and tracking their evolution over time. Similarly, for social media data, machine learning models based on human-in-the-loop learning [Mishra et al., 2015], semi-supervised learning [Mishra and Diesner, 2016], and multi-task learning [Mishra, 2019] are introduced for identifying sentiment, named entities, part of speech tags, phrase chunks, and super-sense tags. The machine learning models are developed with a focus on leveraging all available data. The multi-task models presented here result in competitive performance against other methods, for most of the tasks, while reducing inference time computational costs. Finally, this thesis has resulted in the creation of multiple open source tools and public data sets (see URL below), which can be utilized by the research community. The thesis aims to act as a bridge between research questions and techniques used in DSTD from different domains. The methods and tools presented here can help advance work in the areas of social media and scholarly data analysis.

Download Full-text

Large Scale System for Social Media Data Warehousing

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.290890 ◽

2022 ◽

Vol 18 (1) ◽

pp. 0-0

Keyword(s):

Social Media ◽

Information Extraction ◽

Data Warehouse ◽

Large Scale ◽

Data Warehousing ◽

Very High Frequency ◽

Social Media Data ◽

Large Scale System ◽

Linguistic Rules ◽

Media Data

Social media data become an integral part in the business data and should be integrated into the decisional process for better decision making based on information which reflects better the true situation of business in any field. However, social media data are unstructured and generated in very high frequency which exceeds the capacity of the data warehouse. In this work, we propose to extend the data warehousing process with a staging area which heart is a large scale system implementing an information extraction process using Storm and Hadoop frameworks to better manage their volume and frequency. Concerning structured information extraction, mainly events, we combine a set of techniques from NLP, linguistic rules and machine learning to succeed the task. Finally, we propose the adequate data warehouse conceptual model for events modeling and integration with enterprise data warehouse using an intermediate table called Bridge table. For application and experiments, we focus on drug abuse events extraction from Twitter data and their modeling into the Event Data Warehouse.

Download Full-text

What Counts? Reflections on the Multivalence of Social Media Data

Digital Culture & Society ◽

10.14361/dcs-2016-0203 ◽

2016 ◽

Vol 2 (2) ◽

pp. 19-38 ◽

Cited By ~ 14

Author(s):

Carolin Gerlitz

Keyword(s):

Social Media ◽

Structured Data ◽

Social Media Data ◽

Orders Of Worth ◽

Social Media Platforms ◽

Set Up ◽

Empirical Experiment ◽

Media Data

Abstract Social media platforms have been characterised by their programmability, affordances, constraints and stakeholders - the question of value and valuation of platforms, their data and features has, however, received less attention in platform studies. This paper explores the specific socio-technical conditions for valuating platform data and suggests that platforms set up their data to become multivalent, that is to be valuable alongside multiple, possibly conflicting value regimes. Drawing on both platform and valuation studies, it asks how the production, storing and circulation of data, its connection to user action and the various stakeholders of platforms contribute to its valuation. Platform data, the paper suggests, is the outcome of capture systems which allow to collapse action and its capture into pre-structured data forms which remain open to divergent interpretations. Platforms offer such grammars of action both to users and other stakeholders in frontand back-ends, inviting them to produce and engage with its data following heterogeneous orders of worth. Platform data can participate in different valuation regimes at the same time - however, the paper concludes, not all actors can participate in all modes of valuation, as in the end, it is the platform that sets the conditions for participation. The paper offers a conceptual perspective to interrogate what data counts by attending to questions of quantification, its entanglement with valuation and the various technologies and stakeholders involved. It finishes with an empirical experiment to map the various ways in which Instagram data is made to count.

Download Full-text

Monitoring global trends in Covid-19 vaccination intention and confidence: a social media-based deep learning study

10.1101/2021.04.17.21255642 ◽

2021 ◽

Author(s):

Xinyu Zhou ◽

Alex de Figueiredo ◽

Qin Xu ◽

Leesa Lin ◽

Per E Kummervold ◽

...

Keyword(s):

Social Media ◽

Deep Learning ◽

Real Time ◽

Eastern Mediterranean ◽

Learning Models ◽

Social Media Data ◽

Emerging Trends ◽

The Us ◽

Media Monitoring ◽

Media Data

AbstractBackgroundThis study developed deep learning models to monitor global intention and confidence of Covid-19 vaccination in real time.MethodsWe collected 6.73 million English tweets regarding Covid-19 vaccination globally from January 2020 to February 2021. Fine-tuned Transformer-based deep learning models were used to classify tweets in real time as they relate to Covid-19 vaccination intention and confidence. Temporal and spatial trends were performed to map the global prevalence of Covid-19 vaccination intention and confidence, and public engagement on social media was analyzed.FindingsGlobally, the proportion of tweets indicating intent to accept Covid-19 vaccination declined from 64.49% on March to 39.54% on September 2020, and then began to recover, reaching 52.56% in early 2021. This recovery in vaccine acceptance was largely driven by the US and European region, whereas other regions experienced the declining trends in 2020. Intent to accept and confidence of Covid-19 vaccination were relatively high in South-East Asia, Eastern Mediterranean, and Western Pacific regions, but low in American, European, and African regions. 12.71% tweets expressed misinformation or rumors in South Korea, 14.04% expressed distrust in government in the US, and 16.16% expressed Covid-19 vaccine being unsafe in Greece, ranking first globally. Negative tweets, especially misinformation or rumors, were more engaged by twitters with fewer followers than positive tweets.InterpretationThis global real-time surveillance study highlights the importance of deep learning based social media monitoring to detect emerging trends of Covid-19 vaccination intention and confidence to inform timely interventions.FundingNational Natural Science Foundation of China.Research in contextEvidence before this studyWith COVID-19 vaccine rollout, each country should investigate its vaccination intention in local contexts to ensure massive vaccination. We searched PubMed for all articles/preprints until April 9, 2021 with the keywords “(“Covid-19 vaccines”[Mesh] OR Covid-19 vaccin*[TI]) AND (confidence[TI] OR hesitancy[TI] OR acceptance[TI] OR intention[TI])”. We identified more than 100 studies, most of which are country-level cross-sectional surveys, and the largest global survey of Covid-19 vaccine acceptance only covered 32 countries to date. However, how Covid-19 vaccination intention changes over time remain unknown, and many countries are not covered in previous surveys yet. A few studies assessed public sentiments towards Covid-19 vaccination using social media data, but only targeting limited geographical areas. There is a lack of real-time surveillance, and no study to date has globally monitored Covid-19 vaccination intention in real time.Added value of this studyTo our knowledge, this is the largest global monitoring study of Covid-19 vaccination intention and confidence with social media data in over 100 countries from the beginning of the pandemic to February 2021. This study developed deep learning models by fine-tuning a Bidirectional Encoder Representation from Transformer (BERT)-based model with 8000 manually-classified tweets, which can be used to monitor Covid-19 vaccination beliefs using social media data in real time. It achieves temporal and spatial analyses of the evolving beliefs to Covid-19 vaccines across the world, and also an insight for many countries not yet covered in previous surveys. This study highlights that the intention to accept Covid-19 vaccination have experienced a declining trend since the beginning of the pandemic in all world regions, with some regions recovering recently, though not to their original levels. This recovery was largely driven by the US and European region (EUR), whereas other regions experienced the declining trends in 2020. Intention to accept and confidence of Covid-19 vaccination were relatively high in South-East Asia region (SEAR), Eastern Mediterranean region (EMR), and Western Pacific region (WPR), but low in American region (AMR), EUR, and African region (AFR). Many AFR countries worried more about vaccine effectiveness, while EUR, AMR, and WPR concerned more about vaccine safety (the most concerns with 16.16% in Greece). Online misinformation or rumors were widespread in AMR, EUR, and South Korea (12.71%, ranks first globally), and distrust in government was more prevalent in AMR (14.04% in the US, ranks first globally). Our findings can be used as a reference point for survey data on a single country in the future, and inform timely and specific interventions for each country to address Covid-19 vaccine hesitancy.Implications of all the available evidenceThis global real-time surveillance study highlights the importance of deep learning based social media monitoring as a quick and effective method for detecting emerging trends of Covid-19 vaccination intention and confidence to inform timely interventions, especially in settings with limited sources and urgent timelines. Future research should build multilingual deep learning models and monitor Covid-19 vaccination intention and confidence in real time with data from multiple social media platforms.

Download Full-text

Flood Monitoring with Information Extraction Approach from Social Media Data

2020 IEEE Asia-Pacific Conference on Geoscience, Electronics and Remote Sensing Technology (AGERS) ◽

10.1109/agers51788.2020.9452770 ◽

2020 ◽

Author(s):

Prabu Kresna Putra ◽

Dionysius Bryan Sencaki ◽

Galih Prasetya Dinanta ◽

Fauziah Alhasanah ◽

Rachmat Ramadhan

Keyword(s):

Social Media ◽

Information Extraction ◽

Social Media Data ◽

Flood Monitoring ◽

Media Data

Download Full-text

Post, Mine, and Be Disturbed: Social Media Data Mining

PsycCRITIQUES ◽

10.1037/a0040619 ◽

2016 ◽

Vol 61 (51) ◽

Author(s):

Daniel Keyes

Keyword(s):

Data Mining ◽

Social Media ◽

Social Media Data ◽

Media Data

Download Full-text

Understanding the Interrelationships between Infrastructure Resilience and Social Equity Using Social Media Data

Construction Research Congress 2020 ◽

10.1061/9780784482858.065 ◽

2020 ◽

Author(s):

Sunil Dhakal ◽

Lu Zhang

Keyword(s):

Social Media ◽

Social Equity ◽

Social Media Data ◽

Infrastructure Resilience ◽

Media Data

Download Full-text

Psychological Stress Detection from Social Media Data using a Novel Hybrid Model

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i8.853862 ◽

2018 ◽

Vol 6 (8) ◽

pp. 853-862

Author(s):

Shaikha Hajera ◽

Mohammed Mahmood Ali

Keyword(s):

Social Media ◽

Psychological Stress ◽

Hybrid Model ◽

Stress Detection ◽

Social Media Data ◽

Media Data

Download Full-text

Utilizing Blockchain Technology in Social Media Bot Identification

10.36227/techrxiv.12049374 ◽

2020 ◽

Author(s):

Shreya Reddy ◽

Lisa Ewen ◽

Pankti Patel ◽

Prerak Patel ◽

Ankit Kundal ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Gold Standard ◽

The Internet ◽

Learning Models ◽

Current Time ◽

Machine Learning Methods ◽

Blockchain Technology ◽

Modern Age ◽

Machine Learning Models

<p>As bots become more prevalent and smarter in the modern age of the internet, it becomes ever more important that they be identified and removed. Recent research has dictated that machine learning methods are accurate and the gold standard of bot identification on social media. Unfortunately, machine learning models do not come without their negative aspects such as lengthy training times, difficult feature selection, and overwhelming pre-processing tasks. To overcome these difficulties, we are proposing a blockchain framework for bot identification. At the current time, it is unknown how this method will perform, but it serves to prove the existence of an overwhelming gap of research under this area.<i></i></p>

Download Full-text