Application of Apache Big Data technologies for the problems of climate monitoring

Вычислительные технологии ◽

10.25743/ict.2021.26.2.008 ◽

2021 ◽

pp. 98-108

Author(s):

Сергей Юрьевич Золотов ◽

Игорь Юрьевич Турчановский

Keyword(s):

Big Data ◽

Test Problem ◽

Random Access ◽

Main Idea ◽

Reanalysis Data ◽

Apache Spark ◽

Apache Hadoop ◽

Big Data Technologies ◽

Original File ◽

Disk Subsystems

Описан эксперимент по использованию технологий Apache Big Data в исследованиях климатических систем. В ходе эксперимента реализовано четыре варианта решения тестовой задачи. Ускорение расчетов с помощью технологий Apache Big Data вполне достижимо, и наиболее эффективный способ для этого найден в четвертом варианте решения тестовой задачи. Суть найденного решения сводится к преобразованию исходных наборов данных к формату, подходящему для хранения в распределенной файловой системе и применения технологи Spark SQL из стека Apache Big Data для параллельной обработки данных на вычислительных кластерах. The core of the Apache Big Data stack consists of two technologies: Apache Hadoop for organizing distributed file storages of unlimited capacity and Apache Spark for organizing parallel computing on computing clusters. The combination of Apache Spark and Apache Hadoop is fully applicable for creating big data processing systems. The main idea implemented by Spark is dividing data into separate parts (partitions) and processing these parts in memory of many computers connected within a network. Data is sent only when needed, and Spark automatically detects when the exchange will take place. For testing, we chose the problem of calculating the monthly, annual, and seasonal trends in the temperature of the atmosphere of our planet for the period from 1960 to 2010 according to the NCEP/NCAR and JRA-55 reanalysis data. During the experiment, four variants of solving the test problem were implemented. The first variant represents the simplest implementation without parallelism. The second implementation variant assumes parallel reading of data from the local file system, aggregation, and calculation of trends. The third variant was the calculation of a test problem on a two-node cluster. NCEP and JRA-55 reanalysis files were placed in their original format in the Hadoop storage (HDFS), which combines the disk subsystems of two computers. The disadvantage of this variant is loading all reanalysis files completely into the random access memory of the workflow. The solution proposed in the fourth variant is to pre-convert the original file format to a form when reading from HDFS is selective, based on the specified parameters.

Download Full-text

Design of technological strategy for Big Data with Hadoop software

Journal of Computational Systems and ICTs ◽

10.35429/jcsi.2020.16.6.7.14 ◽

2020 ◽

pp. 7-14

Author(s):

Alicia VALDEZ-MENCHACA ◽

Laura VAZQUEZ-DE LOS SANTOS ◽

Griselda CORTES-MORALES ◽

Ana PAIZ-RIVERA

Keyword(s):

Big Data ◽

Virtual Machine ◽

Software Projects ◽

Apache Hadoop ◽

Innovative Technologies ◽

Design And Implementation ◽

Data Practices ◽

Big Data Technologies ◽

Main Components ◽

Selection Of

The objective of this research project is the design and implementation of a technological strategy for the use of big data technologies as Apache Hadoop, as well as its supporting software projects that allows to prepare medium-sized companies in new innovative technologies. As part of the methodology, an analysis of the best big data practices, analysis of the software for design and configure big data in a linux server for the technological proposal. As a first result, a roadmap for the installation and configuration of Hadoop software running on a Linux virtual machine has been obtained, as well as the proposal of the technological strategy whose main components are: analysis of the technological architecture, selection of processes or data to be analyzed and installation of Hadoop, among others.

Download Full-text

Processing Big Data with Apache Hadoop in the Current Challenging Era of COVID-19

Big Data and Cognitive Computing ◽

10.3390/bdcc5010012 ◽

2021 ◽

Vol 5 (1) ◽

pp. 12

Author(s):

Otmane Azeroual ◽

Renaud Fabre

Keyword(s):

Big Data ◽

Heterogeneous Data ◽

It Infrastructure ◽

Global Organizations ◽

Apache Hadoop ◽

Difficult Time ◽

Big Data Technologies ◽

Strategic Issue ◽

Data Context ◽

Massive Information

Big data have become a global strategic issue, as increasingly large amounts of unstructured data challenge the IT infrastructure of global organizations and threaten their capacity for strategic forecasting. As experienced in former massive information issues, big data technologies, such as Hadoop, should efficiently tackle the incoming large amounts of data and provide organizations with relevant processed information that was formerly neither visible nor manageable. After having briefly recalled the strategic advantages of big data solutions in the introductory remarks, in the first part of this paper, we focus on the advantages of big data solutions in the currently difficult time of the COVID-19 pandemic. We characterize it as an endemic heterogeneous data context; we then outline the advantages of technologies such as Hadoop and its IT suitability in this context. In the second part, we identify two specific advantages of Hadoop solutions, globality combined with flexibility, and we notice that they are at work with a “Hadoop Fusion Approach” that we describe as an optimal response to the context. In the third part, we justify selected qualifications of globality and flexibility by the fact that Hadoop solutions enable comparable returns in opposite contexts of models of partial submodels and of models of final exact systems. In part four, we remark that in both these opposite contexts, Hadoop’s solutions allow a large range of needs to be fulfilled, which fits with requirements previously identified as the current heterogeneous data structure of COVID-19 information. In the final part, we propose a framework of strategic data processing conditions. To the best of our knowledge, they appear to be the most suitable to overcome COVID-19 massive information challenges.

Download Full-text

The research of social processes at the university using big data

MATEC Web of Conferences ◽

10.1051/matecconf/202134801003 ◽

2021 ◽

Vol 348 ◽

pp. 01003

Author(s):

Abdullayev Vugar Hacimahmud ◽

Ragimova Nazila Ali ◽

Khalilov Matlab Etibar

Keyword(s):

Big Data ◽

Social Processes ◽

Apache Hadoop ◽

Big Data Applications ◽

Big Data Technologies ◽

Rapid Pace ◽

Mapreduce Model ◽

The University ◽

Apache Pig ◽

Apache Software Foundation

The volume of information in the 21st century is growing at a rapid pace. Big data technologies are used to process modern information. This article discusses the use of big data technologies to implement monitoring of social processes. Big data has its characteristics and principles, which reflect here. In addition, we also discussed big data applications in some areas. Particular attention in this article pays to the interactions of big data and sociology. For this, there consider digital sociology and computational social sciences. One of the main objects of study in sociology is social processes. The article shows the types of social processes and their monitoring. As an example, there is implemented monitoring of social processes at the university. There are used following technologies for the realization of social processes monitoring: products 1010data (1010edge, 1010connect, 1010reveal, 1010equities), products of Apache Software Foundation (Apache Hive, Apache Chukwa, Apache Hadoop, Apache Pig), MapReduce framework, language R, library Pandas, NoSQL, etc. Despite this, this article examines the use of the MapReduce model for social processes monitoring at the university.

Download Full-text

BigData Analysis in Healthcare: Apache Hadoop , Apache spark and Apache Flink

Frontiers in Health Informatics ◽

10.30699/fhi.v8i1.180 ◽

2019 ◽

Vol 8 (1) ◽

pp. 14 ◽

Cited By ~ 1

Author(s):

Elham Nazari ◽

Mohammad Hasan Shahriari ◽

Hamed Tabesh

Keyword(s):

Big Data ◽

Error Detection ◽

Memory Management ◽

High Speed ◽

Scientific Information ◽

High Volume ◽

Apache Spark ◽

Data Set ◽

Apache Hadoop ◽

The Subject

Introduction: Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important. The purpose of this study is to introduce and compare the most popular and most widely used platform for processing big data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.Material and Methods: This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.Results: The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support flow processing, Apache Spark is also distributed as a computational platform that can process a big data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data flow status.Conclusion: The application of big data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.

Download Full-text

A Brief Survey on Big Data in Healthcare

International Journal of Big Data and Analytics in Healthcare ◽

10.4018/ijbdah.2020010101 ◽

2020 ◽

Vol 5 (1) ◽

pp. 1-18 ◽

Cited By ~ 1

Author(s):

Ebru Aydindag Bayrak ◽

Pinar Kirci

Keyword(s):

Big Data ◽

Healthcare System ◽

Data Analytics ◽

Big Data Analytics ◽

Apache Hadoop ◽

System A ◽

Big Data Technologies ◽

Core Components ◽

Definition Of ◽

Medical Big Data

This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition of big data, the components of big data, medical big data sources, used big data technologies in present, and big data analytics in healthcare have been examined under the different titles. Also, the historical development process of big data analytics has been mentioned. As a known big data analytics technology, Apache Hadoop technology and its core components with tools have been explained briefly. Moreover, a glance of some of the big data analytics tools or platforms apart from Hadoop eco-system were given. The main goal is to help researchers or specialists with giving an opinion about the rising importance of used big data analytics in healthcare systems.

Download Full-text

Pengukuran Performa Apache Spark dengan Library H2O Menggunakan Benchmark Hibench Berbasis Cloud Computing

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2019651520 ◽

2019 ◽

Vol 6 (5) ◽

pp. 519

Author(s):

Aminudin Aminudin ◽

Eko Budi Cahyono

Keyword(s):

Machine Learning ◽

Cloud Computing ◽

Big Data ◽

Large Data ◽

Apache Spark ◽

Weather Data ◽

Computing Environment ◽

Process Data ◽

Apache Hadoop ◽

Cloud Computing Environment

Apache Spark merupakan platform yang dapat digunakan untuk memproses data dengan ukuran data yang relatif besar (big data) dengan kemampuan untuk membagi data tersebut ke masing-masing cluster yang telah ditentukan konsep ini disebut dengan parallel komputing. Apache Spark mempunyai kelebihan dibandingkan dengan framework lain yang serupa misalnya Apache Hadoop dll, di mana Apache Spark mampu memproses data secara streaming artinya data yang masuk ke dalam lingkungan Apache Spark dapat langsung diproses tanpa menunggu data lain terkumpul. Agar di dalam Apache Spark mampu melakukan proses machine learning, maka di dalam paper ini akan dilakukan eksperimen yaitu dengan mengintegrasikan Apache Spark yang bertindak sebagai lingkungan pemrosesan data yang besar dan konsep parallel komputing akan dikombinasikan dengan library H2O yang khusus untuk menangani pemrosesan data menggunakan algoritme machine learning. Berdasarkan hasil pengujian Apache Spark di dalam lingkungan cloud computing, Apache Spark mampu memproses data cuaca yang didapatkan dari arsip data cuaca terbesar yaitu yaitu data NCDC dengan ukuran data sampai dengan 6GB. Data tersebut diproses menggunakan salah satu model machine learning yaitu deep learning dengan membagi beberapa node yang telah terbentuk di lingkungan cloud computing dengan memanfaatkan library H2O. Keberhasilan tersebut dapat dilihat dari parameter pengujian yang telah diujikan meliputi nilai running time, throughput, Avarege Memory dan Average CPU yang didapatkan dari Benchmark Hibench. Semua nilai tersebut dipengaruhi oleh banyaknya data dan jumlah node. AbstractApache Spark is a platform that can be used to process data with relatively large data sizes (big data) with the ability to divide the data into each cluster that has been determined. This concept is called parallel computing. Apache Spark has advantages compared to other similar frameworks such as Apache Hadoop, etc., where Apache Spark is able to process data in streaming, meaning that the data entered into the Apache Spark environment can be directly processed without waiting for other data to be collected. In order for Apache Spark to be able to do machine learning processes, in this paper an experiment will be conducted that integrates Apache Spark which acts as a large data processing environment and the concept of parallel computing will be combined with H2O libraries specifically for handling data processing using machine learning algorithms . Based on the results of testing Apache Spark in a cloud computing environment, Apache Spark is able to process weather data obtained from the largest weather data archive, namely NCDC data with data sizes up to 6GB. The data is processed using one of the machine learning models namely deep learning by dividing several nodes that have been formed in the cloud computing environment by utilizing the H2O library. The success can be seen from the test parameters that have been tested including the value of running time, throughput, Avarege Memory and CPU Average obtained from the Hibench Benchmark. All these values are influenced by the amount of data and number of nodes.

Download Full-text

Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark

Symmetry ◽

10.3390/sym13020195 ◽

2021 ◽

Vol 13 (2) ◽

pp. 195 ◽

Cited By ~ 1

Author(s):

Vladimir Belov ◽

Andrey Tatarintsev ◽

Evgeny Nikulchev

Keyword(s):

Big Data ◽

Data Storage ◽

Experimental Evaluation ◽

Optimization Methods ◽

Apache Spark ◽

Apache Hadoop ◽

Second Stage ◽

Experimental Stand ◽

Storage Format ◽

Hadoop System

One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, CSV, JSON, ORC, parquet. At the first stage, a comparative analysis of the main characteristics of the studied formats was carried out; at the second stage, an experimental evaluation of these formats was prepared and carried out. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to find out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework. In addition, within the study, an algorithm for choosing the optimal format from the presented alternatives was developed using tropical optimization methods. The result of the study is presented in the form of a technique for obtaining a vector of ratings of data storage formats for the Apache Hadoop system, based on an experimental assessment using Apache Spark.

Download Full-text