How Big Does Big Data Need to Be?

Author(s):  
Martin Stange ◽  
Burkhardt Funk

Collecting and storing of as many data as possible is common practice in many companies these days. To reduce costs of collecting and storing data that is not relevant, it is important to define which analytical questions are to be answered and how much data is needed to answer these questions. In this chapter, a process to define an optimal sampling size is proposed. Based on benefit/cost considerations, the authors show how to find the sample size that maximizes the utility of predictive analytics. By applying the proposed process to a case study is shown that only a very small fraction of the available data set is needed to make accurate predictions.

Author(s):  
Sheik Abdullah A. ◽  
Priyadharshini P.

The term Big Data corresponds to a large dataset which is available in different forms of occurrence. In recent years, most of the organizations generate vast amounts of data in different forms which makes the context of volume, variety, velocity, and veracity. Big Data on the volume aspect is based on data set maintenance. The data volume goes to processing usual a database but cannot be handled by a traditional database. Big Data is stored among structured, unstructured, and semi-structured data. Big Data is used for programming, data warehousing, computational frameworks, quantitative aptitude and statistics, and business knowledge. Upon considering the analytics in the Big Data sector, predictive analytics and social media analytics are widely used for determining the pattern or trend which is about to happen. This chapter mainly deals with the tools and techniques that corresponds to big data analytics of various applications.


Author(s):  
Nick Kelly ◽  
Maximiliano Montenegro ◽  
Carlos Gonzalez ◽  
Paula Clasing ◽  
Augusto Sandoval ◽  
...  

Purpose The purpose of this paper is to demonstrate the utility of combining event-centred and variable-centred approaches when analysing big data for higher education institutions. It uses a large, university-wide data set to demonstrate the methodology for this analysis by using the case study method. It presents empirical findings about relationships between student behaviours in a learning management system (LMS) and the learning outcomes of students, and further explores these findings using process modelling techniques. Design/methodology/approach The paper describes a two-year study in a Chilean university, using big data from a LMS and from the central university database of student results and demographics. Descriptive statistics of LMS use in different years presents an overall picture of student use of the system. Process mining is described as an event-centred approach to give a deeper level of understanding of these findings. Findings The study found evidence to support the idea that instructors do not strongly influence student use of an LMS. It replicates existing studies to show that higher-performing students use an LMS differently from the lower-performing students. It shows the value of combining variable- and event-centred approaches to learning analytics. Research limitations/implications The study is limited by its institutional context, its two-year time frame and by its exploratory mode of investigation to create a case study. Practical implications The paper is useful for institutions in developing a methodology for using big data from a LMS to make use of event-centred approaches. Originality/value The paper is valuable in replicating and extending recent studies using event-centred approaches to analysis of learning data. The study here is on a larger scale than the existing studies (using a university-wide data set), in a novel context (Latin America), that provides a clear description for how and why the methodology should inform institutional approaches.


2020 ◽  
Vol 34 (5) ◽  
pp. 845-858
Author(s):  
Johannes C. Eichstaedt ◽  
Aaron C. Weidman

Personality psychologists are increasingly documenting dynamic, within–person processes. Big data methodologies can augment this endeavour by allowing for the collection of naturalistic and personality–relevant digital traces from online environments. Whereas big data methods have primarily been used to catalogue static personality dimensions, here we present a case study in how they can be used to track dynamic fluctuations in psychological states. We apply a text–based, machine learning prediction model to Facebook status updates to compute weekly trajectories of emotional valence and arousal. We train this model on 2895 human–annotated Facebook statuses and apply the resulting model to 303 575 Facebook statuses posted by 640 US Facebook users who had previously self–reported their Big Five traits, yielding an average of 28 weekly estimates per user. We examine the correlations between model–predicted emotion and self–reported personality, providing a test of the robustness of these links when using weekly aggregated data, rather than momentary data as in prior work. We further present dynamic visualizations of weekly valence and arousal for every user, while making the final data set of 17 937 weeks openly available. We discuss the strengths and drawbacks of this method in the context of personality psychology's evolution into a dynamic science. © 2020 European Association of Personality Psychology


2017 ◽  
Vol 3 (4) ◽  
pp. 250-259 ◽  
Author(s):  
Zichan Ruan ◽  
Yuantian Miao ◽  
Lei Pan ◽  
Nicholas Patterson ◽  
Jun Zhang
Keyword(s):  
Big Data ◽  

2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Wissam Nazeer Wassouf ◽  
Ramez Alkhatib ◽  
Kamal Salloum ◽  
Shadi Balloul

2018 ◽  
Vol 46 (3) ◽  
pp. 147-160 ◽  
Author(s):  
Laouni Djafri ◽  
Djamel Amar Bensaber ◽  
Reda Adjoudj

Purpose This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in the shortest possible time. Design/methodology/approach This paper is divided into two parts. The first one is to improve the result of the prediction. In this part, two ideas are proposed: the double pruning enhanced random forest algorithm and extracting a shared learning base from the stratified random sampling method to obtain a representative learning base of all original data. The second part proposes to design a distributed architecture supported by new technologies solutions, which in turn works in a coherent and efficient way with the sampling strategy under the supervision of the Map-Reduce algorithm. Findings The representative learning base obtained by the integration of two learning bases, the partial base and the shared base, presents an excellent representation of the original data set and gives very good results of the Big Data predictive analytics. Furthermore, these results were supported by the improved random forests supervised learning method, which played a key role in this context. Originality/value All companies are concerned, especially those with large amounts of information and want to screen them to improve their knowledge for the customer and optimize their campaigns.


Author(s):  
Anuoluwapo Ajayi ◽  
Lukumon Oyedele ◽  
Juan Manuel Davila Delgado ◽  
Lukman Akanbi ◽  
Muhammad Bilal ◽  
...  

Purpose The purpose of this paper is to highlight the use of the big data technologies for health and safety risks analytics in the power infrastructure domain with large data sets of health and safety risks, which are usually sparse and noisy. Design/methodology/approach The study focuses on using the big data frameworks for designing a robust architecture for handling and analysing (exploratory and predictive analytics) accidents in power infrastructure. The designed architecture is based on a well coherent health risk analytics lifecycle. A prototype of the architecture interfaced various technology artefacts was implemented in the Java language to predict the likelihoods of health hazards occurrence. A preliminary evaluation of the proposed architecture was carried out with a subset of an objective data, obtained from a leading UK power infrastructure company offering a broad range of power infrastructure services. Findings The proposed architecture was able to identify relevant variables and improve preliminary prediction accuracies and explanatory capacities. It has also enabled conclusions to be drawn regarding the causes of health risks. The results represent a significant improvement in terms of managing information on construction accidents, particularly in power infrastructure domain. Originality/value This study carries out a comprehensive literature review to advance the health and safety risk management in construction. It also highlights the inability of the conventional technologies in handling unstructured and incomplete data set for real-time analytics processing. The study proposes a technique in big data technology for finding complex patterns and establishing the statistical cohesion of hidden patterns for optimal future decision making.


2017 ◽  
Vol 8 (2) ◽  
pp. 515-524 ◽  
Author(s):  
Marko Bohanec ◽  
◽  
Mirjana Kljajić Borštnar ◽  
Marko Robnik-Šikonja ◽  
◽  
...  

Sign in / Sign up

Export Citation Format

Share Document