The Data Allocation Strategy Based on Load in NoSQL Database

With the development of Internet technology and Cloud Computing, more and more applications have to be confronted with the challenges of big data. NoSQL Database is fit to the management of big data because of the characteristics of high scalability, high availability and high fault-tolerance. And it is one of the technologies of the management of big data. We will improve the performance of massive data processing of NoSQL Database through the large scale data parallel data processing and data localize of computing. So how to allocate the data will be a big challenge of NoSQL Database. In this paper we will propose a data allocation strategy based on the nodes load, which can adjust the data allocation strategy by the execute status of the system. And it can keep the balance of data allocation by a small cost. At last we will use some experiments to verify the effectiveness of the strategy which is proposed in this paper. The experiments show that it can improve the systems performance than other allocation strategy.

Download Full-text

Big Data Processing

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Applications and Approaches to Object-Oriented Software Design ◽

10.4018/978-1-7998-2142-7.ch005 ◽

2020 ◽

pp. 111-132

Author(s):

Can Eyupoglu

Keyword(s):

Big Data ◽

Data Processing ◽

Large Scale ◽

Scale Up ◽

It Industry ◽

Governmental Agencies ◽

Digital World ◽

Large Scale Data ◽

International Data ◽

Processing Techniques

Big data has attracted significant and increasing attention recently and has become a hot topic in the areas of IT industry, finance, business, academia, and scientific research. In the digital world, the amount of generated data has increased. According to the research of International Data Corporation (IDC), 33 zettabytes of data were created in 2018, and it is estimated that the amount of data will scale up more than five times from 2018 to 2025. In addition, the advertising sector, healthcare industry, biomedical companies, private firms, and governmental agencies have to make many investments in the collection, aggregation, and sharing of enormous amounts of data. To process this large-scale data, specific data processing techniques are used rather than conventional methodologies. This chapter deals with the concepts, architectures, technologies, and techniques that process big data.

Download Full-text

MAPREDUCE: INSIGHT ANALYSIS OF BIG DATA VIA PARALLEL DATA PROCESSING USING JAVA PROGRAMMING, HIVE AND APACHE PIG

International Journal of Advanced Research in Computer Science ◽

10.26483/ijarcs.v9i1.5414 ◽

2018 ◽

Vol 9 (1) ◽

pp. 536-540 ◽

Cited By ~ 1

Author(s):

Dr. Ujjwal Agarwal ◽

Keyword(s):

Big Data ◽

Data Processing ◽

Java Programming ◽

Parallel Data ◽

Apache Pig

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Teaching large scale data processing

Proceedings of the 1st ACM Summit on Computing Education in China on First ACM Summit on Computing Education in China - SCE '08 ◽

10.1145/1517632.1517635 ◽

2008 ◽

Author(s):

Kang Chen ◽

Yubing Yin ◽

Weimin Zheng

Keyword(s):

Data Processing ◽

Large Scale ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data

Download Full-text

Advanced monitoring techniques for a large‐scale data‐processing network

Campus-Wide Information Systems ◽

10.1108/10650740810921448 ◽

2008 ◽

Vol 25 (5) ◽

pp. 287-300 ◽

Cited By ~ 1

Author(s):

B. Martin ◽

A. Al‐Shabibi ◽

S.M. Batraneanu ◽

Ciobotaru ◽

G.L. Darlea ◽

...

Keyword(s):

Data Processing ◽

Large Scale ◽

Monitoring Techniques ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Processing Network ◽

Scale Data

Download Full-text

Large scale data processing in real world: From analytics to predictions

2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer) ◽

10.1109/icter.2014.7083870 ◽

2014 ◽

Author(s):

Srinath Perera

Keyword(s):

Data Processing ◽

Real World ◽

Large Scale ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data

Download Full-text

BestPeer++: A Peer-to-Peer Based Large-Scale Data Processing Platform

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2012.236 ◽

2014 ◽

Vol 26 (6) ◽

pp. 1316-1331 ◽

Cited By ~ 6

Author(s):

Gang Chen ◽

Tianlei Hu ◽

Dawei Jiang ◽

Peng Lu ◽

Kian-Lee Tan ◽

...

Keyword(s):

Data Processing ◽

Large Scale ◽

Peer To Peer ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Processing Platform ◽

Scale Data

Download Full-text

Parallel Data Mining and Applications in Hospital Big Data Processing

Big Data Management and Processing ◽

10.1201/9781315154008-20 ◽

2017 ◽

pp. 403-424

Author(s):

Jianguo Chen ◽

Zhuo Tang ◽

Kenli Li ◽

Keqin Li

Keyword(s):

Data Mining ◽

Big Data ◽

Data Processing ◽

Big Data Processing ◽

Parallel Data ◽

Parallel Data Mining

Download Full-text

Support Vector Machines in Big Data Classification: A Systematic Literature Review

10.21203/rs.3.rs-663359/v1 ◽

2021 ◽

Author(s):

Mohammad Hassan Almaspoor ◽

Ali Safaei ◽

Afshin Salajegheh ◽

Behrouz Minaei-Bidgoli

Keyword(s):

Machine Learning ◽

Big Data ◽

Large Scale ◽

Support Vector ◽

Research Areas ◽

Large Scale Data ◽

Training Samples ◽

Big Data Classification ◽

Scale Data

Abstract Classification is one of the most important and widely used issues in machine learning, the purpose of which is to create a rule for grouping data to sets of pre-existing categories is based on a set of training sets. Employed successfully in many scientific and engineering areas, the Support Vector Machine (SVM) is among the most promising methods of classification in machine learning. With the advent of big data, many of the machine learning methods have been challenged by big data characteristics. The standard SVM has been proposed for batch learning in which all data are available at the same time. The SVM has a high time complexity, i.e., increasing the number of training samples will intensify the need for computational resources and memory. Hence, many attempts have been made at SVM compatibility with online learning conditions and use of large-scale data. This paper focuses on the analysis, identification, and classification of existing methods for SVM compatibility with online conditions and large-scale data. These methods might be employed to classify big data and propose research areas for future studies. Considering its advantages, the SVM can be among the first options for compatibility with big data and classification of big data. For this purpose, appropriate techniques should be developed for data preprocessing in order to covert data into an appropriate form for learning. The existing frameworks should also be employed for parallel and distributed processes so that SVMs can be made scalable and properly online to be able to handle big data.

Download Full-text

Data Lake Ecosystem Workflow

10.21079/11681/40203 ◽

2021 ◽

Author(s):

R. Salter ◽

Quyen Dong ◽

Cody Coleman ◽

Maria Seale ◽

Alicia Ruvinsky ◽

...

Keyword(s):

Big Data ◽

Language Processing ◽

Data Analytics ◽

Large Scale ◽

Big Data Analytics ◽

Lake Ecosystem ◽

Data Governance ◽

Government Organizations ◽

Large Scale Data ◽

Scale Data

The Engineer Research and Development Center, Information Technology Laboratory’s (ERDC-ITL’s) Big Data Analytics team specializes in the analysis of large-scale datasets with capabilities across four research areas that require vast amounts of data to inform and drive analysis: large-scale data governance, deep learning and machine learning, natural language processing, and automated data labeling. Unfortunately, data transfer between government organizations is a complex and time-consuming process requiring coordination of multiple parties across multiple offices and organizations. Past successes in large-scale data analytics have placed a significant demand on ERDC-ITL researchers, highlighting that few individuals fully understand how to successfully transfer data between government organizations; future project success therefore depends on a small group of individuals to efficiently execute a complicated process. The Big Data Analytics team set out to develop a standardized workflow for the transfer of large-scale datasets to ERDC-ITL, in part to educate peers and future collaborators on the process required to transfer datasets between government organizations. Researchers also aim to increase workflow efficiency while protecting data integrity. This report provides an overview of the created Data Lake Ecosystem Workflow by focusing on the six phases required to efficiently transfer large datasets to supercomputing resources located at ERDC-ITL.

Download Full-text