Executing certified model transformations on Apache Spark

Роботу присвячено питанням розподіленої обробки транзакцій при проведенні аналізу великих обсягів даних з метою пошуку асоціативних правил. На основі відомих алгоритмів глибинного аналізу даних для пошуку частих предметних наборів AIS та Apriori було визначено можливі варіанти паралелізації, які позбавлені необхідності ітераційного сканування бази даних та великого споживання пам'яті. Досліджено можливість перенесення обчислень на різні платформи, які підтримують паралельну обробку даних. В якості обчислювальних платформ було обрано MapReduce – потужну базу для обробки великих, розподілених наборів даних на кластері Hadoop, а також програмний інструмент для обробки надзвичайно великої кількості даних Apache Spark. Проведено порівняльний аналіз швидкодії розглянутих методів, отримано рекомендації щодо ефективного використання паралельних обчислювальних платформ, запропоновано модифікації алгоритмів пошуку асоціативних правил. В якості основних завдань, реалізованих в роботі, слід визначити дослідження сучасних засобів розподіленої обробки структурованих і не структурованих даних, розгортання тестового кластера в хмарному сервісі, розробку скриптів для автоматизації розгортання кластера, проведення модифікацій розподілених алгоритмів з метою адаптації під необхідні фреймворки розподілених обчислень, отримання показників швидкодії обробки даних в послідовному і розподіленому режимах з застосуванням Hadoop MapReduce. та Apache Spark, проведення порівняльного аналізу результатів тестових вимірів швидкодії, отримання та обґрунтування залежності між кількістю оброблюваних даних, і часом, витраченим на обробку, оптимізацію розподілених алгоритмів пошуку асоціативних правил при обробці великих обсягів транзакційних даних, отримання показників швидкодії розподіленої обробки існуючими програмними засобами. Ключові слова: розподілена обробка, транзакційні дані, асоціативні правила, обчислюваний кластер, Hadoop, MapReduce, Apache Spark

Download Full-text

Increase the Performance of K-Means Clustering Algorithm Using Apache Spark

The International Journal of Internet of Things and its Applications ◽

10.21742/ijiota.2017.1.1.02 ◽

2017 ◽

Vol 1 (1) ◽

pp. 13-28 ◽

Cited By ~ 1

Author(s):

Chang Xie ◽

Keyword(s):

Clustering Algorithm ◽

Apache Spark

Download Full-text

Exploring Apache Spark Data APIs for Water Big Data Management

Advances in Intelligent Systems and Computing - Advanced Intelligent Systems for Sustainable Development (AI2SD’2018) ◽

10.1007/978-3-030-11881-5_10 ◽

2019 ◽

pp. 105-117

Author(s):

Nassif El Hassane ◽

Hicham Hajji

Keyword(s):

Big Data ◽

Data Management ◽

Apache Spark

Download Full-text

Big data Predictive Analytics for Apache Spark using Machine Learning

2020 Global Conference on Wireless and Optical Technologies (GCWOT) ◽

10.1109/gcwot49901.2020.9391620 ◽

2020 ◽

Author(s):

Muhammad Junaid ◽

Shiraz Ali Wagan ◽

Nawab Muhammad Faseeh Qureshi ◽

Choon Sung Nam ◽

Dong Ryeol Shin

Keyword(s):

Machine Learning ◽

Big Data ◽

Predictive Analytics ◽

Apache Spark

Download Full-text

Scalable Taxonomy Generation and Evolution on Apache Spark

2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) ◽

10.1109/dasc-picom-cbdcom-cyberscitech49142.2020.00110 ◽

2020 ◽

Author(s):

Kanwal Aalijah ◽

Rabia Irfan

Keyword(s):

Apache Spark

Download Full-text

Performance Prediction for Data-driven Workflows on Apache Spark

2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) ◽

10.1109/mascots50786.2020.9285944 ◽

2020 ◽

Author(s):

Andrea Gulino ◽

Arif Canakoglu ◽

Stefano Ceri ◽

Danilo Ardagna

Keyword(s):

Performance Prediction ◽

Apache Spark ◽

Data Driven

Download Full-text

A scalable random walk with restart on heterogeneous networks with Apache Spark for ranking disease-related genes through type-II fuzzy data fusion

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2021.103688 ◽

2021 ◽

Vol 115 ◽

pp. 103688

Author(s):

Mehdi Joodaki ◽

Nasser Ghadiri ◽

Zeinab Maleki ◽

Maryam Lotfi Shahreza

Keyword(s):

Random Walk ◽

Data Fusion ◽

Heterogeneous Networks ◽

Apache Spark ◽

Fuzzy Data ◽

Type Ii ◽

Random Walk With Restart ◽

Disease Related Genes

Download Full-text

Nodule Detection with Convolutional Neural Network Using Apache Spark and GPU Frameworks

Applied Sciences ◽

10.3390/app11062838 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2838

Author(s):

Nikitha Johnsirani Venkatesan ◽

Dong Ryeol Shin ◽

Choon Sung Nam

Keyword(s):

Neural Network ◽

Radiation Dose ◽

Convolutional Neural Network ◽

Model Performance ◽

Performance Comparison ◽

Apache Spark ◽

Training Time ◽

Learning Framework ◽

Proposed Model

In the pharmaceutical field, early detection of lung nodules is indispensable for increasing patient survival. We can enhance the quality of the medical images by intensifying the radiation dose. High radiation dose provokes cancer, which forces experts to use limited radiation. Using abrupt radiation generates noise in CT scans. We propose an optimal Convolutional Neural Network model in which Gaussian noise is removed for better classification and increased training accuracy. Experimental demonstration on the LUNA16 dataset of size 160 GB shows that our proposed method exhibit superior results. Classification accuracy, specificity, sensitivity, Precision, Recall, F1 measurement, and area under the ROC curve (AUC) of the model performance are taken as evaluation metrics. We conducted a performance comparison of our proposed model on numerous platforms, like Apache Spark, GPU, and CPU, to depreciate the training time without compromising the accuracy percentage. Our results show that Apache Spark, integrated with a deep learning framework, is suitable for parallel training computation with high accuracy.

Download Full-text

Filtering and Storing User Preferred Data: an Apache Spark Based Approach

2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) ◽

10.1109/dasc-picom-cbdcom-cyberscitech49142.2020.00115 ◽

2020 ◽

Author(s):

Bannya Chanda ◽

Shikharesh Majumdar

Keyword(s):

Apache Spark

Download Full-text