apache spark Latest Research Papers

Applying Apache Spark on Streaming Big Data for Health Status Prediction

Computers Materials & Continua ◽

10.32604/cmc.2022.019458 ◽

2022 ◽

Vol 70 (2) ◽

pp. 3511-3527

Author(s):

Ahmed Ismail Ebada ◽

Ibrahim Elhenawy ◽

Chang-Won Jeong ◽

Yunyoung Nam ◽

Hazem Elbakry ◽

...

Keyword(s):

Big Data ◽

Health Status ◽

Apache Spark ◽

Streaming Big Data

Halvade somatic: Somatic variant calling with Apache Spark

GigaScience ◽

10.1093/gigascience/giab094 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Dries Decap ◽

Louise de Schaetzen van Brienen ◽

Maarten Larmuseau ◽

Pascal Costanza ◽

Charlotte Herzeel ◽

...

Keyword(s):

Best Practices ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Computing Time ◽

Variant Calling ◽

Apache Spark ◽

Normal Sample ◽

Whole Genome ◽

Sequencing Data ◽

Somatic Variant

Abstract Background The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. Findings We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. Conclusions To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

Real-Time Big Data Analysis Using Web Scraping in Apache Spark Environment: Case Study—Mobile Data Analysis from Flipkart

Artificial Intelligence and Technologies - Lecture Notes in Electrical Engineering ◽

10.1007/978-981-16-6448-9_20 ◽

2021 ◽

pp. 177-185

Author(s):

Pushpita Ganguly ◽

Giriraj Parihar ◽

M. Sivagami

Keyword(s):

Big Data ◽

Data Analysis ◽

Real Time ◽

Big Data Analysis ◽

Apache Spark ◽

Mobile Data ◽

Web Scraping

A Secured Healthcare Management and Service Retrieval for Society Over Apache Spark Hadoop Environment

IETE Journal of Research ◽

10.1080/03772063.2021.1963334 ◽

2021 ◽

pp. 1-20

Author(s):

J. Vimala Ithayan ◽

C. Sundar

Keyword(s):

Healthcare Management ◽

Apache Spark

Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

10.1109/bigdata52589.2021.9671595 ◽

2021 ◽

Author(s):

Sebastiaan Alvarez Rodriguez ◽

Jayjeet Chackrabroty ◽

Aaron Chu ◽

Ivo Jimenez ◽

Jeff LeFevre ◽

...

Keyword(s):

Apache Spark

MBTI personality classification using Apache Spark

10.1109/icecco53203.2021.9663858 ◽

2021 ◽

Author(s):

Kamila Orynbekova ◽

Assem Talasbek ◽

Abylay Omar ◽

Andrey Bogdanchikov ◽

Shirali Kadyrov

Keyword(s):

Apache Spark ◽

Personality Classification

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10110763 ◽

2021 ◽

Vol 10 (11) ◽

pp. 763

Author(s):

Panagiotis Moutafis ◽

George Mavrommatis ◽

Michael Vassilakopoulos ◽

Antonio Corral

Keyword(s):

Query Processing ◽

Nearest Neighbor ◽

Apache Spark ◽

Spatial Query ◽

K Nearest Neighbor ◽

Distributed Computing Systems ◽

Spatial Query Processing ◽

Apache Hadoop ◽

The One ◽

Query Algorithm

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.

Machine Learning in Apache Spark Environment for Diagnosis of Diabetes

10.20944/preprints202111.0200.v1 ◽

2021 ◽

Author(s):

Farshid Bagheri Saravi ◽

Shadi Moghanian ◽

Giti Javidi ◽

Ehsan O Sheybani

Keyword(s):

Machine Learning ◽

Data Mining ◽

Support Vector Machine ◽

Big Data ◽

Random Forest ◽

Apache Spark ◽

Support Vector ◽

Computing Environment ◽

Large Dataset ◽

Related Data

Disease-related data and information collected by physicians, patients, and researchers seem insignificant at first glance. Still, the same unorganized data contain valuable information that is often hidden. The task of data mining techniques is to extract patterns to classify the data accurately. One of the various Data mining and its methods have been used often to diagnose various diseases. In this study, a machine learning (ML) technique based on distributed computing in the Apache Spark computing space is used to diagnose diabetics or hidden pattern of the illness to detect the disease using a large dataset in real-time. Implementation results of three ML techniques of Decision Tree (DT) technique or Random Forest (RF) or Support Vector Machine (SVM) in the Apache Spark computing environment using the Scala programming language and WEKA show that RF is more efficient and faster to diagnose diabetes in big data.

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Big Data and Cognitive Computing ◽

10.3390/bdcc5040065 ◽

2021 ◽

Vol 5 (4) ◽

pp. 65

Author(s):

Nasim Ahmed ◽

Andre L. C. Barczak ◽

Mohammad A. Rashid ◽

Teo Susnjak

Keyword(s):

Big Data ◽

Performance Prediction ◽

Vital Role ◽

Large Datasets ◽

Apache Spark ◽

Planning System ◽

Hadoop Cluster ◽

Experimental Findings ◽

Simple Equations ◽

Efficiency And Reliability

Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.

Big Data Performance in Private Clouds. Some Initial Findings on Apache Spark Clusters Deployed in OpenStack

10.1109/roedunet54112.2021.9638296 ◽

2021 ◽

Author(s):

Marin Fotache ◽

Marius-Iulian Cluci

Keyword(s):

Big Data ◽

Apache Spark

apache spark
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Applying Apache Spark on Streaming Big Data for Health Status Prediction

Halvade somatic: Somatic variant calling with Apache Spark

Real-Time Big Data Analysis Using Web Scraping in Apache Spark Environment: Case Study—Mobile Data Analysis from Flipkart

A Secured Healthcare Management and Service Retrieval for Society Over Apache Spark Hadoop Environment

Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

MBTI personality classification using Apache Spark

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Machine Learning in Apache Spark Environment for Diagnosis of Diabetes

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Big Data Performance in Private Clouds. Some Initial Findings on Apache Spark Clusters Deployed in OpenStack

Export Citation Format

apache sparkRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Applying Apache Spark on Streaming Big Data for Health Status Prediction

Halvade somatic: Somatic variant calling with Apache Spark

Real-Time Big Data Analysis Using Web Scraping in Apache Spark Environment: Case Study—Mobile Data Analysis from Flipkart

A Secured Healthcare Management and Service Retrieval for Society Over Apache Spark Hadoop Environment

Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

MBTI personality classification using Apache Spark

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Machine Learning in Apache Spark Environment for Diagnosis of Diabetes

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Big Data Performance in Private Clouds. Some Initial Findings on Apache Spark Clusters Deployed in OpenStack

apache spark
Recently Published Documents