Techniques for Sampling Online Text-Based Data Sets

Big Data ◽  
2016 ◽  
pp. 655-675
Author(s):  
Lynne M. Webb ◽  
Yuanxin Wang

The chapter reviews traditional sampling techniques and suggests adaptations relevant to big data studies of text downloaded from online media such as email messages, online gaming, blogs, micro-blogs (e.g., Twitter), and social networking websites (e.g., Facebook). The authors review methods of probability, purposeful, and adaptive sampling of online data. They illustrate the use of these sampling techniques via published studies that report analysis of online text.

Author(s):  
Lynne M. Webb ◽  
Yuanxin Wang

The chapter reviews traditional sampling techniques and suggests adaptations relevant to big data studies of text downloaded from online media such as email messages, online gaming, blogs, micro-blogs (e.g., Twitter), and social networking websites (e.g., Facebook). The authors review methods of probability, purposeful, and adaptive sampling of online data. They illustrate the use of these sampling techniques via published studies that report analysis of online text.


2015 ◽  
Vol 30 (2) ◽  
pp. 157-170 ◽  
Author(s):  
Rizwana Irfan ◽  
Christine K. King ◽  
Daniel Grages ◽  
Sam Ewen ◽  
Samee U. Khan ◽  
...  

AbstractIn this survey, we review different text mining techniques to discover various textual patterns from the social networking sites. Social network applications create opportunities to establish interaction among people leading to mutual learning and sharing of valuable knowledge, such as chat, comments, and discussion boards. Data in social networking websites is inherently unstructured and fuzzy in nature. In everyday life conversations, people do not care about the spellings and accurate grammatical construction of a sentence that may lead to different types of ambiguities, such as lexical, syntactic, and semantic. Therefore, analyzing and extracting information patterns from such data sets are more complex. Several surveys have been conducted to analyze different methods for the information extraction. Most of the surveys emphasized on the application of different text mining techniques for unstructured data sets reside in the form of text documents, but do not specifically target the data sets in social networking website. This survey attempts to provide a thorough understanding of different text mining techniques as well as the application of these techniques in the social networking websites. This survey investigates the recent advancement in the field of text analysis and covers two basic approaches of text mining, such as classification and clustering that are widely used for the exploration of the unstructured text available on the Web.


2020 ◽  
Vol 10 (2) ◽  
pp. 7-10
Author(s):  
Deepti Pandey

This article provides insight into an emerging research discipline called Psychoinformatics.In the context of Psychoinformatics, we emphasize the co-operation between the disciplines of Psychology and Information Science which handles large data sets is derivative from severely used devices like smartphones or any online social networking in order to highlight  sychological qualities including both personality and mood. New challenges await psychologists considering the result “Big Data” sets because classic psychological methods will only in part be able to analyze this data derived from ubiquitous mobile devices as well as other everyday technologies. Consequently, psychologist must enrich their scientific methods through the inclusion of methods from informatics. Furthermore, we also emphasize on data which is derived from Psychoinformatics to combine in a such a way to give meaningful way with data from human neuroscience. We close the article with some observations of areas for future research and problems that require consideration within this new discipline.


2018 ◽  
Vol 26 (3) ◽  
pp. 499-530 ◽  
Author(s):  
Valentina Ndou ◽  
Giustina Secundo ◽  
John Dumay ◽  
Elvin Gjevori

PurposeIntellectual capital disclosure (ICD) in universities is gaining increasing attention, especially through the adoption of innovative technologies. Online media, as a relevant source of Big Data, is shifting ICD. The purpose of this paper is to explore how Big Data generated through online media, such as websites and platforms like Facebook, can be used as rich sources of data and viable disclosure channels for ICD in a university.Design/methodology/approachThis is an exploratory case study, following the methodology in Yin (2014), that examines how online media data contributes to closing the ICD gap. The IC disclosed through different online media channels by a private university in Albania is analysed using Secundo et al.’s (2016) collective intelligence framework. The online data sources include the university’s website, Facebook page, periodic reports and statements outlining future goals.FindingsWhat the authors discover in this research is that IC is an important part of how universities operate, and IC is communicated through social media, although unintentionally. However, this only serves to highlight the importance of IC, and if researchers want to discover IC and understand how it works in an organisation, they need to include social media and a prime resource for developing that understanding.Research limitations/implicationsMost importantly, the findings add to a growing consensus that ICD researchers, and researchers in other management and accounting disciplines, who traditionally rely on annual corporate social responsibility and other periodic reports, they need to change their medium of analysis because these reports no longer can be relied on to understand IC and its impact on an organisation.Originality/valueOnline media tools and the advent of Big Data have created new opportunities for universities to disclose their IC information to stakeholders in a timely manner and to gain relevant insights into their impact on the society. The originality of the paper resides in the contribution of Big Data to the ICD research stream.


2020 ◽  
pp. 23-112
Author(s):  
Dariusz Jemielniak

The chapter presents the idea of Thick Big Data, a methodological approach combining big data sets with thick, ethnographic analysis. It presents different quantitative methods, including Google Correlate, social network analysis (SNA), online polls, culturomics, and data scraping, as well as easy tools to start working with online data. It describes the key differences in performing qualitative studies online, by focusing on the example of digital ethnography. It helps using case studies for digital communities as well. It gives specific guidance on conducting interviews online, and describes how to perform narrative analysis of digital culture. It concludes with describing methods of studying online cultural production, and discusses the notions of remix culture, memes, and trolling.


2014 ◽  
Author(s):  
Pankaj K. Agarwal ◽  
Thomas Moelhave
Keyword(s):  
Big Data ◽  

2020 ◽  
Vol 13 (4) ◽  
pp. 790-797
Author(s):  
Gurjit Singh Bhathal ◽  
Amardeep Singh Dhiman

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.


Sign in / Sign up

Export Citation Format

Share Document