Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management

Large-Scale Sensor Network Analysis

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch013 ◽

2013 ◽

pp. 314-347 ◽

Cited By ~ 1

Author(s):

Joaquin Vanschoren ◽

Ugo Vespier ◽

Shengfa Miao ◽

Marvin Meeng ◽

Ricardo Cachucho ◽

...

Keyword(s):

Big Data ◽

Data Analysis ◽

Large Scale ◽

Vital Signs ◽

Sensor Data ◽

Atmospheric Conditions ◽

Big Data Applications ◽

The World ◽

Sheer Size ◽

Effective Use

Sensors are increasingly being used to monitor the world around us. They measure movements of structures such as bridges, windmills, and plane wings, human’s vital signs, atmospheric conditions, and fluctuations in power and water networks. In many cases, this results in large networks with different types of sensors, generating impressive amounts of data. As the volume and complexity of data increases, their effective use becomes more challenging, and novel solutions are needed both on a technical as well as a scientific level. Founded on several real-world applications, this chapter discusses the challenges involved in large-scale sensor data analysis and describes practical solutions to address them. Due to the sheer size of the data and the large amount of computation involved, these are clearly “Big Data” applications.

Download Full-text

Big Data Sharing Among Academics

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch008 ◽

2013 ◽

pp. 177-194 ◽

Cited By ~ 1

Author(s):

Jeonghyun Kim

Keyword(s):

Big Data ◽

Social Science ◽

Case Studies ◽

Data Sharing ◽

Physical Science ◽

Life Science ◽

Social Research ◽

Data Repositories ◽

Current Technology ◽

Oceanographic Data

The goal of this chapter is to explore the practice of big data sharing among academics and issues related to this sharing. The first part of the chapter reviews literature on big data sharing practices using current technology. The second part presents case studies on disciplinary data repositories in terms of their requirements and policies. It describes and compares such requirements and policies at disciplinary repositories in three areas: Dryad for life science, Interuniversity Consortium for Political and Social Research (ICPSR) for social science, and the National Oceanographic Data Center (NODC) for physical science.

Download Full-text

Big Data Management in the Context of Real-Time Data Warehousing

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch007 ◽

2013 ◽

pp. 150-176

Author(s):

M. Asif Naeem ◽

Gillian Dobbie ◽

Gerald Weber

Keyword(s):

Big Data ◽

Data Integration ◽

Real Time ◽

Real Life ◽

Skewed Distribution ◽

Stream Data ◽

Time Data ◽

Master Data ◽

Real Time Data ◽

Resource Aware

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.

Download Full-text

Parallel Data Reduction Techniques for Big Datasets

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch004 ◽

2013 ◽

pp. 72-93 ◽

Cited By ~ 8

Author(s):

Ahmet Artu Yıldırım ◽

Cem Özdoğan ◽

Dan Watson

Keyword(s):

Big Data ◽

Data Reduction ◽

Memory Systems ◽

Discrete Wavelet ◽

Server Side ◽

Advantages And Disadvantages ◽

Reduction Techniques ◽

Parallel Data ◽

Processing Techniques

Data reduction is perhaps the most critical component in retrieving information from big data (i.e., petascale-sized data) in many data-mining processes. The central issue of these data reduction techniques is to save time and bandwidth in enabling the user to deal with larger datasets even in minimal resource environments, such as in desktop or small cluster systems. In this chapter, the authors examine the motivations behind why these reduction techniques are important in the analysis of big datasets. Then they present several basic reduction techniques in detail, stressing the advantages and disadvantages of each. The authors also consider signal processing techniques for mining big data by the use of discrete wavelet transformation and server-side data reduction techniques. Lastly, they include a general discussion on parallel algorithms for data reduction, with special emphasis given to parallel wavelet-based multi-resolution data reduction techniques on distributed memory systems using MPI and shared memory architectures on GPUs along with a demonstration of the improvement of performance and scalability for one case study.

Download Full-text

Accelerating Large-Scale Genome-Wide Association Studies with Graphics Processors

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch014 ◽

2013 ◽

pp. 349-380

Author(s):

Mian Lu ◽

Qiong Luo

Keyword(s):

Sequence Alignment ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association ◽

Floating Point ◽

Genome Wide Association Studies ◽

Graphics Processors ◽

Snp Detection ◽

Genome Wide ◽

Order Of Magnitude

Large-scale Genome-Wide Association Studies (GWAS) are a Big Data application due to the great amount of data to process and high computation intensity. Furthermore, numerical issues (e.g., floating point underflow) limit the data scale in some applications. Graphics Processors (GPUs) have been used to accelerate genomic data analytics, such as sequence alignment, single-Nucleotide Polymorphism (SNP) detection, and Minor Allele Frequency (MAF) computation. As MAF computation is the most time-consuming task in GWAS, the authors discuss in detail their techniques of accelerating this task using the GPU. They first present a reduction-based algorithm that better matches the GPU’s data-parallelism feature than the original algorithm implemented in the CPU-based tool. Then they implement this algorithm on the GPU efficiently by carefully optimizing local memory utilization and avoiding user-level synchronization. As the MAF computation suffers from floating point underflow, the authors transform the computation to logarithm space. In addition to the MAF computation, they briefly introduce the GPU-accelerated sequence alignment and SNP detection. The experimental results show that the GPU-based GWAS implementations can accelerate state-of-the-art CPU-based tools by up to an order of magnitude.

Download Full-text

Big Data at Scale for Digital Humanities

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch011 ◽

2013 ◽

pp. 270-294

Author(s):

Stacy T. Kowalczyk ◽

Yiming Sun ◽

Zong Peng ◽

Beth Plale ◽

Aaron Todd ◽

...

Keyword(s):

Big Data ◽

Digital Humanities ◽

Research Center ◽

Research Tools

Big Data in the humanities is a new phenomenon that is expected to revolutionize the process of humanities research. The HathiTrust Research Center (HTRC) is a cyberinfrastructure to support humanities research on big humanities data. The HathiTrust Research Center has been designed to make the technology serve the researcher to make the content easy to find, to make the research tools efficient and effective, to allow researchers to customize their environment, to allow researchers to combine their own data with that of the HTRC, and to allow researchers to contribute tools. The architecture has multiple layers of abstraction providing a secure, scalable, extendable, and generalizable interface for both human and computational users.

Download Full-text

Scalable Data Mining, Archiving, and Big Data Management for the Next Generation Astronomical Telescopes

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch009 ◽

2013 ◽

pp. 196-221 ◽

Cited By ~ 3

Author(s):

Chris A. Mattmann ◽

Andrew Hart ◽

Luca Cinquini ◽

Joseph Lazio ◽

Shakeh Khudikyan ◽

...

Keyword(s):

Big Data ◽

Data Management ◽

Next Generation ◽

Data Types ◽

Science Data ◽

Very Large Array ◽

Data Formats ◽

Astronomical Instruments ◽

Data Volume ◽

The Difference

Big data as a paradigm focuses on data volume, velocity, and on the number and complexity of various data formats and metadata, a set of information that describes other data types. This is nowhere better seen than in the development of the software to support next generation astronomical instruments including the MeerKAT/KAT-7 Square Kilometre Array (SKA) precursor in South Africa, in the Low Frequency Array (LOFAR) in Europe, in two instruments led in part by the U.S. National Radio Astronomy Observatory (NRAO) with its Expanded Very Large Array (EVLA) in Socorro, NM, and Atacama Large Millimeter Array (ALMA) in Chile, and in other instruments such as the Large Synoptic Survey Telescope (LSST) to be built in northern Chile. This chapter highlights the big data challenges in constructing data management systems for these astronomical instruments, specifically the challenge of integrating legacy science codes, handling data movement and triage, building flexible science data portals and user interfaces, allowing for flexible technology deployment scenarios, and in automatically and rapidly mitigating the difference in science data formats and metadata models. The authors discuss these challenges and then suggest open source solutions to them based on software from the Apache Software Foundation including Apache Object-Oriented Data Technology (OODT), Tika, and Solr. The authors have leveraged these solutions to effectively and expeditiously build many precursor and operational software systems to handle data from these astronomical instruments and to prepare for the coming data deluge from those not constructed yet. Their solutions are not specific to the astronomical domain and they are already applicable to a number of science domains including Earth, planetary, and biomedicine.

Download Full-text

Techniques for Sampling Online Text-Based Data Sets

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch005 ◽

2013 ◽

pp. 95-114 ◽

Cited By ~ 8

Author(s):

Lynne M. Webb ◽

Yuanxin Wang

Keyword(s):

Big Data ◽

Social Networking ◽

Adaptive Sampling ◽

Online Gaming ◽

Online Media ◽

Sampling Techniques ◽

Data Sets ◽

Online Data ◽

Social Networking Websites ◽

Report Analysis

The chapter reviews traditional sampling techniques and suggests adaptations relevant to big data studies of text downloaded from online media such as email messages, online gaming, blogs, micro-blogs (e.g., Twitter), and social networking websites (e.g., Facebook). The authors review methods of probability, purposeful, and adaptive sampling of online data. They illustrate the use of these sampling techniques via published studies that report analysis of online text.

Download Full-text

Synchronizing Execution of Big Data in Distributed and Parallelized Environments

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch003 ◽

2013 ◽

pp. 47-71 ◽

Cited By ~ 2

Author(s):

Gueyoung Jung ◽

Tridib Mukherjee

Keyword(s):

Big Data ◽

Distributed System ◽

Data Analytics ◽

High Performance ◽

Large Scale ◽

Big Data Analytics ◽

Loosely Coupled ◽

Current Trends ◽

Distributed Computing Infrastructures ◽

Performance Computing

In the modern information era, the amount of data has exploded. Current trends further indicate exponential growth of data in the future. This prevalent humungous amount of data—referred to as big data—has given rise to the problem of finding the “needle in the haystack” (i.e., extracting meaningful information from big data). Many researchers and practitioners are focusing on big data analytics to address the problem. One of the major issues in this regard is the computation requirement of big data analytics. In recent years, the proliferation of many loosely coupled distributed computing infrastructures (e.g., modern public, private, and hybrid clouds, high performance computing clusters, and grids) have enabled high computing capability to be offered for large-scale computation. This has allowed the execution of the big data analytics to gather pace in recent years across organizations and enterprises. However, even with the high computing capability, it is a big challenge to efficiently extract valuable information from vast astronomical data. Hence, we require unforeseen scalability of performance to deal with the execution of big data analytics. A big question in this regard is how to maximally leverage the high computing capabilities from the aforementioned loosely coupled distributed infrastructure to ensure fast and accurate execution of big data analytics. In this regard, this chapter focuses on synchronous parallelization of big data analytics over a distributed system environment to optimize performance.

Download Full-text

Technologies for Big Data

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch001 ◽

2013 ◽

pp. 1-22 ◽

Cited By ~ 8

Author(s):

Kapil Bakshi

Keyword(s):

Big Data ◽

Big Data Analytics ◽

Integrated Approach ◽

Design Considerations ◽

Massively Parallel Processing ◽

Performance Benchmarks ◽

Technical Architecture ◽

Big Data Technologies ◽

And Performance ◽

Technical Operation

This chapter provides a review and analysis of several key Big Data technologies. Currently, there are many Big Data technologies in development and implementation; hence, a comprehensive review of all of these technologies is beyond the scope of this chapter. This chapter focuses on the most popularly accepted technologies. The key Big Data technologies to be discussed include: Map-Reduce, NOSQL technology, MPP (Massively Parallel Processing), and In Memory Databases technologies. For each of these Big Data technologies, the following subtopics are discussed: the history and genesis of the Big Data technologies, problem set that this technology solves for Big Data analytics, the details of the technologies, including components, technical architecture, and theory of operations. This is followed by technical operation and infrastructure (compute, storage, and network), design considerations, and performance benchmarks. Finally, this chapter provides an integrated approach to the above-mentioned Big Data technologies.

Download Full-text

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Large-Scale Sensor Network Analysis

Big Data Sharing Among Academics

Big Data Management in the Context of Real-Time Data Warehousing

Parallel Data Reduction Techniques for Big Datasets

Accelerating Large-Scale Genome-Wide Association Studies with Graphics Processors

Big Data at Scale for Digital Humanities

Scalable Data Mining, Archiving, and Big Data Management for the Next Generation Astronomical Telescopes

Techniques for Sampling Online Text-Based Data Sets

Synchronizing Execution of Big Data in Distributed and Parallelized Environments

Technologies for Big Data

Export Citation Format

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database ManagementLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Large-Scale Sensor Network Analysis

Big Data Sharing Among Academics

Big Data Management in the Context of Real-Time Data Warehousing

Parallel Data Reduction Techniques for Big Datasets

Accelerating Large-Scale Genome-Wide Association Studies with Graphics Processors

Big Data at Scale for Digital Humanities

Scalable Data Mining, Archiving, and Big Data Management for the Next Generation Astronomical Telescopes

Techniques for Sampling Online Text-Based Data Sets

Synchronizing Execution of Big Data in Distributed and Parallelized Environments

Technologies for Big Data

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management
Latest Publications