scholarly journals Large Scale System for Social Media Data Warehousing

2022 ◽  
Vol 18 (1) ◽  
pp. 0-0

Social media data become an integral part in the business data and should be integrated into the decisional process for better decision making based on information which reflects better the true situation of business in any field. However, social media data are unstructured and generated in very high frequency which exceeds the capacity of the data warehouse. In this work, we propose to extend the data warehousing process with a staging area which heart is a large scale system implementing an information extraction process using Storm and Hadoop frameworks to better manage their volume and frequency. Concerning structured information extraction, mainly events, we combine a set of techniques from NLP, linguistic rules and machine learning to succeed the task. Finally, we propose the adequate data warehouse conceptual model for events modeling and integration with enterprise data warehouse using an intermediate table called Bridge table. For application and experiments, we focus on drug abuse events extraction from Twitter data and their modeling into the Event Data Warehouse.

2020 ◽  
Vol 54 (1) ◽  
pp. 1-2
Author(s):  
Shubhanshu Mishra

Information extraction (IE) aims at extracting structured data from unstructured or semi-structured data. The thesis starts by identifying social media data and scholarly communication data as a special case of digital social trace data (DSTD). This identification allows us to utilize the graph structure of the data (e.g., user connected to a tweet, author connected to a paper, author connected to authors, etc.) for developing new information extraction tasks. The thesis focuses on information extraction from DSTD, first, using only the text data from tweets and scholarly paper abstracts, and then using the full graph structure of Twitter and scholarly communications datasets. This thesis makes three major contributions. First, new IE tasks based on DSTD representation of the data are introduced. For scholarly communication data, methods are developed to identify article and author level novelty [Mishra and Torvik, 2016] and expertise. Furthermore, interfaces for examining the extracted information are introduced. A social communication temporal graph (SCTG) is introduced for comparing different communication data like tweets tagged with sentiment, tweets about a search query, and Facebook group posts. For social media, new text classification categories are introduced, with the aim of identifying enthusiastic and supportive users, via their tweets. Additionally, the correlation between sentiment classes and Twitter meta-data in public corpora is analyzed, leading to the development of a better model for sentiment classification [Mishra and Diesner, 2018]. Second, methods are introduced for extracting information from social media and scholarly data. For scholarly data, a semi-automatic method is introduced for the construction of a large-scale taxonomy of computer science concepts. The method relies on the Wikipedia category tree. The constructed taxonomy is used for identifying key computer science phrases in scholarly papers, and tracking their evolution over time. Similarly, for social media data, machine learning models based on human-in-the-loop learning [Mishra et al., 2015], semi-supervised learning [Mishra and Diesner, 2016], and multi-task learning [Mishra, 2019] are introduced for identifying sentiment, named entities, part of speech tags, phrase chunks, and super-sense tags. The machine learning models are developed with a focus on leveraging all available data. The multi-task models presented here result in competitive performance against other methods, for most of the tasks, while reducing inference time computational costs. Finally, this thesis has resulted in the creation of multiple open source tools and public data sets (see URL below), which can be utilized by the research community. The thesis aims to act as a bridge between research questions and techniques used in DSTD from different domains. The methods and tools presented here can help advance work in the areas of social media and scholarly data analysis.


IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 114851-114861 ◽  
Author(s):  
Zhiguang Zhou ◽  
Xinlong Zhang ◽  
Xiaoyun Zhou ◽  
Yuhua Liu

Author(s):  
Michael Yulianto ◽  
Abba Suganda Girsang ◽  
Reinert Yosua Rumagit

Electronic ticket (eticket) provider services are growing fast in Indonesia, makingthe competition between companies increasingly intense. Moreover, most of them have the sameservice or feature for serving their customers. To get back the feedback of their customers, manycompanies use social media (Facebook and Twitter) for marketing activity or communicatingdirectly with their customers. The development of current technology allows the company totake data from social media. Thus, many companies take social media data for analyses. Thisstudy proposed developing a data warehouse to analyze data in social media such as likes,comments, and sentiment. Since the sentiment is not provided directly from social media data,this study uses lexicon based classification to categorize the sentiment of users’ comments. Thisdata warehouse provides business intelligence to see the performance of the company based ontheir social media data. The data warehouse is built using three travel companies in Indonesia.As a result, this data warehouse provides the comparison of the performance based on the socialmedia data.


2019 ◽  
Vol 38 (5) ◽  
pp. 633-650 ◽  
Author(s):  
Josh Pasek ◽  
Colleen A. McClain ◽  
Frank Newport ◽  
Stephanie Marken

Researchers hoping to make inferences about social phenomena using social media data need to answer two critical questions: What is it that a given social media metric tells us? And who does it tell us about? Drawing from prior work on these questions, we examine whether Twitter sentiment about Barack Obama tells us about Americans’ attitudes toward the president, the attitudes of particular subsets of individuals, or something else entirely. Specifically, using large-scale survey data, this study assesses how patterns of approval among population subgroups compare to tweets about the president. The findings paint a complex picture of the utility of digital traces. Although attention to subgroups improves the extent to which survey and Twitter data can yield similar conclusions, the results also indicate that sentiment surrounding tweets about the president is no proxy for presidential approval. Instead, after adjusting for demographics, these two metrics tell similar macroscale, long-term stories about presidential approval but very different stories at a more granular level and over shorter time periods.


Author(s):  
Suppawong Tuarob ◽  
Conrad S. Tucker

The authors of this work propose a Knowledge Discovery in Databases (KDD) model for predicting product market adoption and longevity using large scale, social media data. Social media data, available through sites such as Twitter® and Facebook®, have been shown to be leading indicators and predictors of events ranging from influenza spread, financial stock market prices, and movie revenues. Being ubiquitous and colloquial in nature allows users to honestly express their opinions in a unified, dynamic manner. This makes social media a relatively new data gathering source that can potentially appeal to designers and enterprise decision makers aiming to understand consumers response to their upcoming/newly launched products. Existing design methodologies for leveraging large scale data have traditionally relied on product reviews available on the internet to mine product information. However, such web reviews often come from disparate sources, making the aggregation and knowledge discovery process quite cumbersome, especially reviews for poorly received products. Furthermore, such web reviews have not been shown to be strong indicators of new product market adoption. In this paper, the authors demonstrate how social media can be used to predict and mine information relating to product features, product competition and market adoption. In particular, the authors analyze the sentiment in tweets and use the results to predict product sales. The authors present a mathematical model that can quantify the correlations between social media sentiment and product market adoption in an effort to compute the ability to stay in the market of individual products. The proposed technique involves computing the Subjectivity, Polarity, and Favorability of the product. Finally, the authors utilize Information Retrieval techniques to mine users’ opinions about strong, weak, and controversial features of a given product model. The authors evaluate their approaches using the real-world smartphone data, which are obtained from www.statista.com and www.gsmarena.com.


Author(s):  
Xiaomo Liu ◽  
Armineh Nourbakhsh ◽  
Quanzhi Li ◽  
Sameena Shah ◽  
Robert Martin ◽  
...  

2020 ◽  
Vol 376 ◽  
pp. 244-255 ◽  
Author(s):  
Zhiguang Zhou ◽  
Xinlong Zhang ◽  
Zhiyong Guo ◽  
Yuhua Liu

2015 ◽  
Vol 137 (7) ◽  
Author(s):  
Suppawong Tuarob ◽  
Conrad S. Tucker

Lead users play a vital role in next generation product development, as they help designers discover relevant product feature preferences months or even years before they are desired by the general customer base. Existing design methodologies proposed to extract lead user preferences are typically constrained by temporal, geographic, size, and heterogeneity limitations. To mitigate these challenges, the authors of this work propose a set of mathematical models that mine social media networks for lead users and the product features that they express relating to specific products. The authors hypothesize that: (i) lead users are discoverable from large scale social media networks and (ii) product feature preferences, mined from lead user social media data, represent product features that do not currently exist in product offerings but will be desired in future product launches. An automated approach to lead user product feature identification is proposed to identify latent features (product features unknown to the public) from social media data. These latent features then serve as the key to discovering innovative users from the ever increasing pool of social media users. The authors collect 2.1 × 109 social media messages in the United States during a period of 31 months (from March 2011 to September 2013) in order to determine whether lead user preferences are discoverable and relevant to next generation cell phone designs.


Sign in / Sign up

Export Citation Format

Share Document