Data Preparation for Big Data Analytics

Author(s):  
Andreas Schmidt ◽  
Martin Atzmueller ◽  
Martin Hollender

This chapter provides an overview of methods for preprocessing structured and unstructured data in the scope of Big Data. Specifically, this chapter summarizes according methods in the context of a real-world dataset in a petro-chemical production setting. The chapter describes state-of-the-art methods for data preparation for Big Data Analytics. Furthermore, the chapter discusses experiences and first insights in a specific project setting with respect to a real-world case study. Furthermore, interesting directions for future research are outlined.

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Marwa Rabe Mohamed Elkmash ◽  
Magdy Gamal Abdel-Kader ◽  
Bassant Badr El Din

Purpose This study aims to investigate and explore the impact of big data analytics (BDA) as a mechanism that could develop the ability to measure customers’ performance. To accomplish the research aim, the theoretical discussion was developed through the combination of the diffusion of innovation theory with the technology acceptance model (TAM) that is less developed for the research field of this study. Design/methodology/approach Empirical data was obtained using Web-based quasi-experiments with 104 Egyptian accounting professionals. Further, the Wilcoxon signed-rank test and the chi-square goodness-of-fit test were used to analyze data. Findings The empirical results indicate that measuring customers’ performance based on BDA increase the organizations’ ability to analyze the customers’ unstructured data, decrease the cost of customers’ unstructured data analysis, increase the ability to handle the customers’ problems quickly, minimize the time spent to analyze the customers’ data and obtaining the customers’ performance reports and control managers’ bias when they measure customer satisfaction. The study findings supported the accounting professionals’ acceptance of BDA through the TAM elements: the intention to use (R), perceived usefulness (U) and the perceived ease of use (E). Research limitations/implications This study has several limitations that could be addressed in future research. First, this study focuses on customers’ performance measurement (CPM) only and ignores other performance measurements such as employees’ performance measurement and financial performance measurement. Future research can examine these areas. Second, this study conducts a Web-based experiment with Master of Business Administration students as a study’s participants, researchers could conduct a laboratory experiment and report if there are differences. Third, owing to the novelty of the topic, there was a lack of theoretical evidence in developing the study’s hypotheses. Practical implications This study succeeds to provide the much-needed empirical evidence for BDA positive impact in improving CPM efficiency through the proposed framework (i.e. CPM and BDA framework). Furthermore, this study contributes to the improvement of the performance measurement process, thus, the decision-making process with meaningful and proper insights through the capability of collecting and analyzing the customers’ unstructured data. On a practical level, the company could eventually use this study’s results and the new insights to make better decisions and develop its policies. Originality/value This study holds significance as it provides the much-needed empirical evidence for BDA positive impact in improving CPM efficiency. The study findings will contribute to the enhancement of the performance measurement process through the ability of gathering and analyzing the customers’ unstructured data.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Kiran Adnan ◽  
Rehan Akbar

Abstract Process of information extraction (IE) is used to extract useful information from unstructured or semi-structured data. Big data arise new challenges for IE techniques with the rapid growth of multifaceted also called as multidimensional unstructured data. Traditional IE systems are inefficient to deal with this huge deluge of unstructured big data. The volume and variety of big data demand to improve the computational capabilities of these IE systems. It is necessary to understand the competency and limitations of the existing IE techniques related to data pre-processing, data extraction and transformation, and representations for huge volumes of multidimensional unstructured data. Numerous studies have been conducted on IE, addressing the challenges and issues for different data types such as text, image, audio and video. Very limited consolidated research work have been conducted to investigate the task-dependent and task-independent limitations of IE covering all data types in a single study. This research work address this limitation and present a systematic literature review of state-of-the-art techniques for a variety of big data, consolidating all data types. Recent challenges of IE are also identified and summarized. Potential solutions are proposed giving future research directions in big data IE. The research is significant in terms of recent trends and challenges related to big data analytics. The outcome of the research and recommendations will help to improve the big data analytics by making it more productive.


Author(s):  
Shungang Ning ◽  
Jianzhong Sun ◽  
Cui Liu ◽  
Yang Yi

Big data analytics with deep learning approach have attracted increasing attention in transportation engineering, involving operations, maintenance, and safety. In commercial aviation sectors, operational, and maintenance data produced on modern aircraft is increasing exponentially, and predictive analysis of these data is an exciting and promising field in aviation maintenance, which has a potential to revolutionize aerospace maintenance industry. This study illustrates the state-of-the-art applications of deep learning in big data analytics for predictive maintenance and a real-world case study for commercial aircraft. A Long Short-Term Memory network based Auto-Encoders (LSTM-AE) is proposed for complex aircraft system fault detection and classification, which makes use of the raw time-series data from heterogeneous sensors. The proposed method uses nominal time-series samples corresponding to healthy behavior of the system to learn a reconstruction model based on LSTM-AE framework. Then the system health index (HI) and fault feature vectors are derived from the reconstruction error matrix for fault detection and classification. The proposed method is demonstrated on a real-world data set from a commercial aircraft fleet. The typical PCV faults as well as the 390 F sensor and 450 F sensor faults due to sense line air leakage are successfully detected and distinguished based on the extracted features. The case study results show that the computed HI can effectively characterize the health state of the aircraft system and different fault types can be identified with high confidence, which is helpful for line fault troubleshooting.


2019 ◽  
Vol 19 (1) ◽  
pp. 24-47 ◽  
Author(s):  
Matteo Golfarelli ◽  
Stefano Rizzi

In big data analytics, advanced analytic techniques operate on big datasets aimed at complementing the role of traditional OLAP for decision making. To enable companies to take benefit of these techniques despite the lack of in-house technical skills, the H2020 TOREADOR Project adopts a model-driven architecture for streamlining analysis processes, from data preparation to their visualization. In this article, we propose a new approach named SkyViz focused on the visualization area, in particular on (1) how to specify the user’s objectives and describe the dataset to be visualized, (2) how to translate this specification into a platform-independent visualization type, and (3) how to concretely implement this visualization type on the target execution platform. To support step (1), we define a visualization context based on seven prioritizable coordinates for assessing the user’s objectives and conceptually describing the data to be visualized. To automate step (2), we propose a skyline-based technique that translates a visualization context into a set of most suitable visualization types. Finally, to automate step (3), we propose a skyline-based technique that, with reference to a specific platform, finds the best bindings between the columns of the dataset and the graphical coordinates used by the visualization type chosen by the user. SkyViz can be transparently extended to include more visualization types on one hand, more visualization coordinates on the other. The article is completed by an evaluation of SkyViz based on a case study excerpted from the pilot applications of the TOREADOR Project.


2020 ◽  
Vol 98 ◽  
pp. 68-78 ◽  
Author(s):  
Aseem Kinra ◽  
Samaneh Beheshti-Kashi ◽  
Rasmus Buch ◽  
Thomas Alexander Sick Nielsen ◽  
Francisco Pereira

2015 ◽  
Vol 2015 ◽  
pp. 1-16 ◽  
Author(s):  
Ashwin Belle ◽  
Raghuram Thiagarajan ◽  
S. M. Reza Soroushmehr ◽  
Fatemeh Navidi ◽  
Daniel A. Beard ◽  
...  

The rapidly expanding field of big data analytics has started to play a pivotal role in the evolution of healthcare practices and research. It has provided tools to accumulate, manage, analyze, and assimilate large volumes of disparate, structured, and unstructured data produced by current healthcare systems. Big data analytics has been recently applied towards aiding the process of care delivery and disease exploration. However, the adoption rate and research development in this space is still hindered by some fundamental problems inherent within the big data paradigm. In this paper, we discuss some of these major challenges with a focus on three upcoming and promising areas of medical research: image, signal, and genomics based analytics. Recent research which targets utilization of large volumes of medical data while combining multimodal data from disparate sources is discussed. Potential areas of research within this field which have the ability to provide meaningful impact on healthcare delivery are also examined.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Rajesh Kumar Singh ◽  
Saurabh Agrawal ◽  
Abhishek Sahu ◽  
Yigit Kazancoglu

PurposeThe proposed article is aimed at exploring the opportunities, challenges and possible outcomes of incorporating big data analytics (BDA) into health-care sector. The purpose of this study is to find the research gaps in the literature and to investigate the scope of incorporating new strategies in the health-care sector for increasing the efficiency of the system.Design/methodology/approachFora state-of-the-art literature review, a systematic literature review has been carried out to find out research gaps in the field of healthcare using big data (BD) applications. A detailed research methodology including material collection, descriptive analysis and categorization is utilized to carry out the literature review.FindingsBD analysis is rapidly being adopted in health-care sector for utilizing precious information available in terms of BD. However, it puts forth certain challenges that need to be focused upon. The article identifies and explains the challenges thoroughly.Research limitations/implicationsThe proposed study will provide useful guidance to the health-care sector professionals for managing health-care system. It will help academicians and physicians for evaluating, improving and benchmarking the health-care strategies through BDA in the health-care sector. One of the limitations of the study is that it is based on literature review and more in-depth studies may be carried out for the generalization of results.Originality/valueThere are certain effective tools available in the market today that are currently being used by both small and large businesses and corporations. One of them is BD, which may be very useful for health-care sector. A comprehensive literature review is carried out for research papers published between 1974 and 2021.


Sign in / Sign up

Export Citation Format

Share Document