disk subsystems Latest Research Papers

2021 ◽

pp. 98-108

Author(s):

Сергей Юрьевич Золотов ◽

Игорь Юрьевич Турчановский

Keyword(s):

Big Data ◽

Test Problem ◽

Random Access ◽

Main Idea ◽

Reanalysis Data ◽

Apache Spark ◽

Apache Hadoop ◽

Big Data Technologies ◽

Original File ◽

Disk Subsystems

Описан эксперимент по использованию технологий Apache Big Data в исследованиях климатических систем. В ходе эксперимента реализовано четыре варианта решения тестовой задачи. Ускорение расчетов с помощью технологий Apache Big Data вполне достижимо, и наиболее эффективный способ для этого найден в четвертом варианте решения тестовой задачи. Суть найденного решения сводится к преобразованию исходных наборов данных к формату, подходящему для хранения в распределенной файловой системе и применения технологи Spark SQL из стека Apache Big Data для параллельной обработки данных на вычислительных кластерах. The core of the Apache Big Data stack consists of two technologies: Apache Hadoop for organizing distributed file storages of unlimited capacity and Apache Spark for organizing parallel computing on computing clusters. The combination of Apache Spark and Apache Hadoop is fully applicable for creating big data processing systems. The main idea implemented by Spark is dividing data into separate parts (partitions) and processing these parts in memory of many computers connected within a network. Data is sent only when needed, and Spark automatically detects when the exchange will take place. For testing, we chose the problem of calculating the monthly, annual, and seasonal trends in the temperature of the atmosphere of our planet for the period from 1960 to 2010 according to the NCEP/NCAR and JRA-55 reanalysis data. During the experiment, four variants of solving the test problem were implemented. The first variant represents the simplest implementation without parallelism. The second implementation variant assumes parallel reading of data from the local file system, aggregation, and calculation of trends. The third variant was the calculation of a test problem on a two-node cluster. NCEP and JRA-55 reanalysis files were placed in their original format in the Hadoop storage (HDFS), which combines the disk subsystems of two computers. The disadvantage of this variant is loading all reanalysis files completely into the random access memory of the workflow. The solution proposed in the fourth variant is to pre-convert the original file format to a form when reading from HDFS is selective, based on the specified parameters.

Download Full-text