Efficient processing of complex XSD using Hive and Spark

PeerJ Computer Science ◽

10.7717/peerj-cs.652 ◽

2021 ◽

Vol 7 ◽

pp. e652

Author(s):

Diana Martinez-Mosquera ◽

Rosa Navarrete ◽

Sergio Luján-Mora

Keyword(s):

Big Data ◽

Performance Management ◽

Mobile Networks ◽

Real Life ◽

Real Data ◽

Xml Schema ◽

Apache Spark ◽

Data Sets ◽

Apache Hive

The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.

Download Full-text

Data Science and Big Data Practice Using Apache Spark and Python

Advances in Data Mining and Database Management - Intelligent Analytics With Advanced Multi-Industry Applications ◽

10.4018/978-1-7998-4963-6.ch004 ◽

2021 ◽

pp. 67-95

Author(s):

Li Chen ◽

Lala Aicha Coulibaly

Keyword(s):

Information Technology ◽

Big Data ◽

Computer Science ◽

Data Analytics ◽

Data Science ◽

Principal Component ◽

Real Data ◽

Apache Spark ◽

Data Sets ◽

Information Technology Students

Data science and big data analytics are still at the center of computer science and information technology. Students and researchers not in computer science often found difficulties in real data analytics using programming languages such as Python and Scala, especially when they attempt to use Apache-Spark in cloud computing environments-Spark Scala and PySpark. At the same time, students in information technology could find it difficult to deal with the mathematical background of data science algorithms. To overcome these difficulties, this chapter will provide a practical guideline to different users in this area. The authors cover the main algorithms for data science and machine learning including principal component analysis (PCA), support vector machine (SVM), k-means, k-nearest neighbors (kNN), regression, neural networks, and decision trees. A brief description of these algorithms will be explained, and the related code will be selected to fit simple data sets and real data sets. Some visualization methods including 2D and 3D displays will be also presented in this chapter.

Download Full-text

Development and Evaluation of a Big Data Framework for Performance Management in Mobile Networks

IEEE Access ◽

10.1109/access.2020.3045175 ◽

2020 ◽

Vol 8 ◽

pp. 226380-226396

Author(s):

Diana Martinez-Mosquera ◽

Rosa Navarrete ◽

Sergio Lujan-Mora

Keyword(s):

Big Data ◽

Performance Management ◽

Mobile Networks ◽

Data Framework

Download Full-text

Public Administration Curriculum-Based Big Data Policy-Analytic Epistemology

Advances in Data Mining and Database Management - Handbook of Research on Big Data and the IoT ◽

10.4018/978-1-5225-7432-3.ch024 ◽

2019 ◽

pp. 467-488

Author(s):

Emmanuel N. A. Tetteh

Keyword(s):

Big Data ◽

Action Learning ◽

Real Life ◽

Big Data Analytics ◽

Social Challenges ◽

Political Policy ◽

New Information ◽

Data Policy ◽

The Internet Of Things

The equilibration that underscores the internet of things (IoT) and big data analytics (BDA) cannot be underestimated at the behest of real-life social challenges and significant policy data generated to redress the concerns of epistemic communities, such as political policy actors, stakeholders, and the citizenry. The cognitive balancing of new information gathered by BDA and assimilated across the IoT is at the crossroads of ascertaining how the growing increases of such BDA can be better managed to transition from the big data state of disequilibration to reach a more stable equilibrium of policy data usefulness. In the quest for explicating the equilibration of policy data usefulness, an account of the curriculum-based MPA policy analysis and analytics concentration program at Norwich University is described as a case example of big data policy-analytic epistemology. The case study offers a symbolic ideology of an IoT action-learning solution model as a recommendation for fostering the stable equilibration of policy data usefulness.

Download Full-text

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Applied Computer Systems ◽

10.2478/acss-2019-0013 ◽

2019 ◽

Vol 24 (2) ◽

pp. 104-110

Author(s):

Duygu Sinanc Terzi ◽

Seref Sagiroglu

Keyword(s):

Big Data ◽

Class Imbalance ◽

Area Under The Curve ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

The Common ◽

Public Datasets ◽

Distributed Cluster

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text

A business intelligence framework for Sultan Qaboos University: A case study in the Middle East

Journal of Intelligence Studies in Business ◽

10.37380/jisib.v7i3.278 ◽

2017 ◽

Vol 7 (3) ◽

Author(s):

Saud Sultan Al Rashdi ◽

Smitha Sunil Kumaran Nair

Keyword(s):

Big Data ◽

Performance Management ◽

Business Intelligence ◽

Teaching And Learning ◽

Successful Implementation ◽

Sultan Qaboos University ◽

Making Sense ◽

Usable Information ◽

The University

Higher education institutions generate big data, yet they are not exploited toobtain usable information. Making sense of data within organizations becomes the key factorfor success in maintaining sustainability within the market and gaining competitiveadvantages. Business intelligence and analytics addresses the challenges of data visibility anddata integrity that helps to shift the big data to provide deep insights into such data. Thisresearch aims to build a customized business intelligence (BI) framework for Sultan QaboosUniversity (SQU). The research starts with assessing the BI maturity of the educationalinstitutions prior to implementation followed by developing a BI prototype to test BI capabilitiesof performance management in SQU. The prototype has been tested for the key business activity(KBA): teaching and learning at one college of the university. The results show that theaggregation of the different KBAs and KPIs will contribute to the overall SQU performance andwill provide better visibility of how SQU as an organization is functioning, which is the keytowards the successful implementation of BI within SQU in the future.

Download Full-text

Overview of Big Data and Its Visualization

10.4018/978-1-6684-3662-2.ch002 ◽

2022 ◽

pp. 22-53

Author(s):

Richard S. Segall ◽

Gao Niu

Keyword(s):

Big Data ◽

Traffic Safety ◽

Data Analytics ◽

Big Data Analytics ◽

Real Data ◽

Flow Diagram ◽

Data Sets ◽

United States Department ◽

Big Data Visualization ◽

Challenges And Opportunities

Big Data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. This chapter discusses what Big Data is and its characteristics, and how this information revolution of Big Data is transforming our lives and the new technology and methodologies that have been developed to process data of these huge dimensionalities. This chapter discusses the components of the Big Data stack interface, categories of Big Data analytics software and platforms, descriptions of the top 20 Big Data analytics software. Big Data visualization techniques are discussed with real data from fatality analysis reporting system (FARS) managed by National Highway Traffic Safety Administration (NHTSA) of the United States Department of Transportation. Big Data web-based visualization software are discussed that are both JavaScript-based and user-interface-based. This chapter also discusses the challenges and opportunities of using Big Data and presents a flow diagram of the 30 chapters within this handbook.

Download Full-text

Public Administration Curriculum-Based Big Data Policy-Analytic Epistemology

10.4018/978-1-6684-3662-2.ch063 ◽

2022 ◽

pp. 1307-1328

Author(s):

Emmanuel N. A. Tetteh

Keyword(s):

Big Data ◽

Action Learning ◽

Real Life ◽

Big Data Analytics ◽

Social Challenges ◽

Political Policy ◽

New Information ◽

Data Policy ◽

The Internet Of Things

Download Full-text

Towards Flexible Retrieval, Integration and Analysis of JSON Data Sets through Fuzzy Sets: A Case Study

Information ◽

10.3390/info12070258 ◽

2021 ◽

Vol 12 (7) ◽

pp. 258

Author(s):

Paolo Fosci ◽

Giuseppe Psaila

Keyword(s):

Fuzzy Sets ◽

Query Language ◽

Traditional Approach ◽

Open Data ◽

Real Data ◽

Data Sets ◽

Practical Case ◽

Innovative Capabilities ◽

Potential Applications

How to exploit the incredible variety of JSON data sets currently available on the Internet, for example, on Open Data portals? The traditional approach would require getting them from the portals, then storing them into some JSON document store and integrating them within the document store. However, once data are integrated, the lack of a query language that provides flexible querying capabilities could prevent analysts from successfully completing their analysis. In this paper, we show how the J-CO Framework, a novel framework that we developed at the University of Bergamo (Italy) to manage large collections of JSON documents, is a unique and innovative tool that provides analysts with querying capabilities based on fuzzy sets over JSON data sets. Its query language, called J-CO-QL, is continuously evolving to increase potential applications; the most recent extensions give analysts the capability to retrieve data sets directly from web portals as well as constructs to apply fuzzy set theory to JSON documents and to provide analysts with the capability to perform imprecise queries on documents by means of flexible soft conditions. This paper presents a practical case study in which real data sets are retrieved, integrated and analyzed to effectively show the unique and innovative capabilities of the J-CO Framework.

Download Full-text

A Novel Generator of Continuous Probability Distributions for the Asymmetric Left-skewed Bimodal Real-life Data with Properties and Copulas

Pakistan Journal of Statistics and Operation Research ◽

10.18187/pjsor.v17i4.3903 ◽

2021 ◽

pp. 943-961 ◽

Cited By ~ 1

Author(s):

Wahid A. M. Shehata ◽

Haitham Yousof ◽

Mohamed Aboraya

Keyword(s):

Probability Distributions ◽

Real Life ◽

Real Data ◽

Moment Generating Function ◽

Data Sets ◽

Base Line ◽

Survival Times ◽

New Family ◽

Real Life Data ◽

Two Parameter

This paper presents a novel two-parameter G family of distributions. Relevant statistical properties such as the ordinary moments, incomplete moments and moment generating function are derived. Using common copulas, some new bivariate type G families are derived. Special attention is devoted to the standard exponential base line model. The density of the new exponential extension can be “asymmetric and right skewed shape” with no peak, “asymmetric right skewed shape” with one peak, “symmetric shape” and “asymmetric left skewed shape” with one peak. The hazard rate of the new exponential distribution can be “increasing”, “U-shape”, “decreasing” and “J-shape”. The usefulness and flexibility of the new family is illustrated by means of two applications to real data sets. The new family is compared with many common G families in modeling relief times and survival times data sets.

Download Full-text

Paying Attention to the Trees in the Forest, or a Call to Examine Agency-Specific Stories

Review of Public Personnel Administration ◽

10.1177/0734371x17753865 ◽

2018 ◽

Vol 39 (4) ◽

pp. 523-543 ◽

Cited By ~ 4

Author(s):

Ellen V. Rubin ◽

Keith P. Baker

Keyword(s):

Performance Management ◽

Case Studies ◽

Quantitative Research ◽

Case Study Research ◽

Large Data ◽

Diversity Management ◽

Data Sets ◽

Study Research ◽

Qualitative And Quantitative

Public administration scholarship needs to strike a better balance between large sample studies and in-depth case studies. The availability of large data sets has led us to engage in empirical research that is broad in scope but is frequently devoid of rich context. In-depth case studies can help to explain why we observe particular relationships and can help us to clarify gaps and inconsistencies in theory. Our argument for more case studies aims to encourage researchers to bridge insights from qualitative and quantitative research through triangulation. We describe the value of case study research, and qualitative and quantitative design options. We then propose opportunities for case study research in public personnel scholarship on patronage pressures, performance management, and diversity management.

Download Full-text