Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management
Latest Publications


TOTAL DOCUMENTS

17
(FIVE YEARS 0)

H-INDEX

4
(FIVE YEARS 0)

Published By IGI Global

9781466646995, 9781466647008

Author(s):  
Joaquin Vanschoren ◽  
Ugo Vespier ◽  
Shengfa Miao ◽  
Marvin Meeng ◽  
Ricardo Cachucho ◽  
...  

Sensors are increasingly being used to monitor the world around us. They measure movements of structures such as bridges, windmills, and plane wings, human’s vital signs, atmospheric conditions, and fluctuations in power and water networks. In many cases, this results in large networks with different types of sensors, generating impressive amounts of data. As the volume and complexity of data increases, their effective use becomes more challenging, and novel solutions are needed both on a technical as well as a scientific level. Founded on several real-world applications, this chapter discusses the challenges involved in large-scale sensor data analysis and describes practical solutions to address them. Due to the sheer size of the data and the large amount of computation involved, these are clearly “Big Data” applications.


Author(s):  
Jeonghyun Kim

The goal of this chapter is to explore the practice of big data sharing among academics and issues related to this sharing. The first part of the chapter reviews literature on big data sharing practices using current technology. The second part presents case studies on disciplinary data repositories in terms of their requirements and policies. It describes and compares such requirements and policies at disciplinary repositories in three areas: Dryad for life science, Interuniversity Consortium for Political and Social Research (ICPSR) for social science, and the National Oceanographic Data Center (NODC) for physical science.


Author(s):  
M. Asif Naeem ◽  
Gillian Dobbie ◽  
Gerald Weber

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.


Author(s):  
Ahmet Artu Yıldırım ◽  
Cem Özdoğan ◽  
Dan Watson

Data reduction is perhaps the most critical component in retrieving information from big data (i.e., petascale-sized data) in many data-mining processes. The central issue of these data reduction techniques is to save time and bandwidth in enabling the user to deal with larger datasets even in minimal resource environments, such as in desktop or small cluster systems. In this chapter, the authors examine the motivations behind why these reduction techniques are important in the analysis of big datasets. Then they present several basic reduction techniques in detail, stressing the advantages and disadvantages of each. The authors also consider signal processing techniques for mining big data by the use of discrete wavelet transformation and server-side data reduction techniques. Lastly, they include a general discussion on parallel algorithms for data reduction, with special emphasis given to parallel wavelet-based multi-resolution data reduction techniques on distributed memory systems using MPI and shared memory architectures on GPUs along with a demonstration of the improvement of performance and scalability for one case study.


Author(s):  
Chris A. Mattmann ◽  
Andrew Hart ◽  
Luca Cinquini ◽  
Joseph Lazio ◽  
Shakeh Khudikyan ◽  
...  

Big data as a paradigm focuses on data volume, velocity, and on the number and complexity of various data formats and metadata, a set of information that describes other data types. This is nowhere better seen than in the development of the software to support next generation astronomical instruments including the MeerKAT/KAT-7 Square Kilometre Array (SKA) precursor in South Africa, in the Low Frequency Array (LOFAR) in Europe, in two instruments led in part by the U.S. National Radio Astronomy Observatory (NRAO) with its Expanded Very Large Array (EVLA) in Socorro, NM, and Atacama Large Millimeter Array (ALMA) in Chile, and in other instruments such as the Large Synoptic Survey Telescope (LSST) to be built in northern Chile. This chapter highlights the big data challenges in constructing data management systems for these astronomical instruments, specifically the challenge of integrating legacy science codes, handling data movement and triage, building flexible science data portals and user interfaces, allowing for flexible technology deployment scenarios, and in automatically and rapidly mitigating the difference in science data formats and metadata models. The authors discuss these challenges and then suggest open source solutions to them based on software from the Apache Software Foundation including Apache Object-Oriented Data Technology (OODT), Tika, and Solr. The authors have leveraged these solutions to effectively and expeditiously build many precursor and operational software systems to handle data from these astronomical instruments and to prepare for the coming data deluge from those not constructed yet. Their solutions are not specific to the astronomical domain and they are already applicable to a number of science domains including Earth, planetary, and biomedicine.


Author(s):  
Mian Lu ◽  
Qiong Luo

Large-scale Genome-Wide Association Studies (GWAS) are a Big Data application due to the great amount of data to process and high computation intensity. Furthermore, numerical issues (e.g., floating point underflow) limit the data scale in some applications. Graphics Processors (GPUs) have been used to accelerate genomic data analytics, such as sequence alignment, single-Nucleotide Polymorphism (SNP) detection, and Minor Allele Frequency (MAF) computation. As MAF computation is the most time-consuming task in GWAS, the authors discuss in detail their techniques of accelerating this task using the GPU. They first present a reduction-based algorithm that better matches the GPU’s data-parallelism feature than the original algorithm implemented in the CPU-based tool. Then they implement this algorithm on the GPU efficiently by carefully optimizing local memory utilization and avoiding user-level synchronization. As the MAF computation suffers from floating point underflow, the authors transform the computation to logarithm space. In addition to the MAF computation, they briefly introduce the GPU-accelerated sequence alignment and SNP detection. The experimental results show that the GPU-based GWAS implementations can accelerate state-of-the-art CPU-based tools by up to an order of magnitude.


Author(s):  
Stacy T. Kowalczyk ◽  
Yiming Sun ◽  
Zong Peng ◽  
Beth Plale ◽  
Aaron Todd ◽  
...  

Big Data in the humanities is a new phenomenon that is expected to revolutionize the process of humanities research. The HathiTrust Research Center (HTRC) is a cyberinfrastructure to support humanities research on big humanities data. The HathiTrust Research Center has been designed to make the technology serve the researcher to make the content easy to find, to make the research tools efficient and effective, to allow researchers to customize their environment, to allow researchers to combine their own data with that of the HTRC, and to allow researchers to contribute tools. The architecture has multiple layers of abstraction providing a secure, scalable, extendable, and generalizable interface for both human and computational users.


Author(s):  
Lynne M. Webb ◽  
Yuanxin Wang

The chapter reviews traditional sampling techniques and suggests adaptations relevant to big data studies of text downloaded from online media such as email messages, online gaming, blogs, micro-blogs (e.g., Twitter), and social networking websites (e.g., Facebook). The authors review methods of probability, purposeful, and adaptive sampling of online data. They illustrate the use of these sampling techniques via published studies that report analysis of online text.


Author(s):  
Gueyoung Jung ◽  
Tridib Mukherjee

In the modern information era, the amount of data has exploded. Current trends further indicate exponential growth of data in the future. This prevalent humungous amount of data—referred to as big data—has given rise to the problem of finding the “needle in the haystack” (i.e., extracting meaningful information from big data). Many researchers and practitioners are focusing on big data analytics to address the problem. One of the major issues in this regard is the computation requirement of big data analytics. In recent years, the proliferation of many loosely coupled distributed computing infrastructures (e.g., modern public, private, and hybrid clouds, high performance computing clusters, and grids) have enabled high computing capability to be offered for large-scale computation. This has allowed the execution of the big data analytics to gather pace in recent years across organizations and enterprises. However, even with the high computing capability, it is a big challenge to efficiently extract valuable information from vast astronomical data. Hence, we require unforeseen scalability of performance to deal with the execution of big data analytics. A big question in this regard is how to maximally leverage the high computing capabilities from the aforementioned loosely coupled distributed infrastructure to ensure fast and accurate execution of big data analytics. In this regard, this chapter focuses on synchronous parallelization of big data analytics over a distributed system environment to optimize performance.


Author(s):  
Kapil Bakshi

This chapter provides a review and analysis of several key Big Data technologies. Currently, there are many Big Data technologies in development and implementation; hence, a comprehensive review of all of these technologies is beyond the scope of this chapter. This chapter focuses on the most popularly accepted technologies. The key Big Data technologies to be discussed include: Map-Reduce, NOSQL technology, MPP (Massively Parallel Processing), and In Memory Databases technologies. For each of these Big Data technologies, the following subtopics are discussed: the history and genesis of the Big Data technologies, problem set that this technology solves for Big Data analytics, the details of the technologies, including components, technical architecture, and theory of operations. This is followed by technical operation and infrastructure (compute, storage, and network), design considerations, and performance benchmarks. Finally, this chapter provides an integrated approach to the above-mentioned Big Data technologies.


Sign in / Sign up

Export Citation Format

Share Document