On Scalability of Distributed Machine Learning with Big Data on Apache Spark

Iktishaf+: A Big Data Tool with Automatic Labeling for Road Traffic Social Sensing and Event Detection Using Distributed Machine Learning

Sensors ◽

10.3390/s21092993 ◽

2021 ◽

Vol 21 (9) ◽

pp. 2993

Author(s):

Ebtesam Alomari ◽

Iyad Katib ◽

Aiiad Albeshri ◽

Tan Yigitcanlar ◽

Rashid Mehmood

Keyword(s):

Machine Learning ◽

Social Media ◽

Saudi Arabia ◽

Big Data ◽

Prior Knowledge ◽

Event Detection ◽

Road Traffic ◽

Arabic Language ◽

Apache Spark ◽

Distributed Machine Learning

Digital societies could be characterized by their increasing desire to express themselves and interact with others. This is being realized through digital platforms such as social media that have increasingly become convenient and inexpensive sensors compared to physical sensors in many sectors of smart societies. One such major sector is road transportation, which is the backbone of modern economies and costs globally 1.25 million deaths and 50 million human injuries annually. The cutting-edge on big data-enabled social media analytics for transportation-related studies is limited. This paper brings a range of technologies together to detect road traffic-related events using big data and distributed machine learning. The most specific contribution of this research is an automatic labelling method for machine learning-based traffic-related event detection from Twitter data in the Arabic language. The proposed method has been implemented in a software tool called Iktishaf+ (an Arabic word meaning discovery) that is able to detect traffic events automatically from tweets in the Arabic language using distributed machine learning over Apache Spark. The tool is built using nine components and a range of technologies including Apache Spark, Parquet, and MongoDB. Iktishaf+ uses a light stemmer for the Arabic language developed by us. We also use in this work a location extractor developed by us that allows us to extract and visualize spatio-temporal information about the detected events. The specific data used in this work comprises 33.5 million tweets collected from Saudi Arabia using the Twitter API. Using support vector machines, naïve Bayes, and logistic regression-based classifiers, we are able to detect and validate several real events in Saudi Arabia without prior knowledge, including a fire in Jeddah, rains in Makkah, and an accident in Riyadh. The findings show the effectiveness of Twitter media in detecting important events with no prior knowledge about them.

Download Full-text

Big data Predictive Analytics for Apache Spark using Machine Learning

2020 Global Conference on Wireless and Optical Technologies (GCWOT) ◽

10.1109/gcwot49901.2020.9391620 ◽

2020 ◽

Author(s):

Muhammad Junaid ◽

Shiraz Ali Wagan ◽

Nawab Muhammad Faseeh Qureshi ◽

Choon Sung Nam ◽

Dong Ryeol Shin

Keyword(s):

Machine Learning ◽

Big Data ◽

Predictive Analytics ◽

Apache Spark

Download Full-text

Parallel and Distributed Machine Learning Algorithms for Scalable Big Data Analytics

Future Generation Computer Systems ◽

10.1016/j.future.2019.07.009 ◽

2020 ◽

Vol 108 ◽

pp. 1159-1161

Author(s):

Henri Bal ◽

Arindam Pal

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analytics ◽

Learning Algorithms ◽

Big Data Analytics ◽

Machine Learning Algorithms ◽

Distributed Machine Learning

Download Full-text

Big data machine learning using apache spark MLlib

2017 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2017.8258338 ◽

2017 ◽

Cited By ~ 20

Author(s):

Mehdi Assefi ◽

Ehsun Behravesh ◽

Guangchi Liu ◽

Ahmad P. Tafti

Keyword(s):

Machine Learning ◽

Big Data ◽

Apache Spark

Download Full-text

Distributed Machine Learning in Big Data Era for Smart City

From Internet of Things to Smart Cities ◽

10.1201/9781315154503-6 ◽

2017 ◽

pp. 151-177

Author(s):

Yuan. Zuo ◽

Yulei. Wu ◽

Geyong. Min ◽

Chengqiang. Huang ◽

Xing. Zhang

Keyword(s):

Machine Learning ◽

Big Data ◽

Smart City ◽

Distributed Machine Learning

Download Full-text

Comparative Analysis of Machine Learning Techniques to Identify Churn for Telecom Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.34.19210 ◽

2018 ◽

Vol 7 (3.34) ◽

pp. 291

Author(s):

M Malleswari ◽

R.J Manira ◽

Praveen Kumar ◽

Murugan .

Keyword(s):

Machine Learning ◽

Big Data ◽

Random Forest ◽

Decision Tree ◽

Apache Spark ◽

Machine Learning Techniques ◽

Churn Prediction ◽

Learning Techniques ◽

Boosted Tree ◽

Customer Attrition

Big data analytics has been the focus for large scale data processing. Machine learning and Big data has great future in prediction. Churn prediction is one of the sub domain of big data. Preventing customer attrition especially in telecom is the advantage of churn prediction. Churn prediction is a day-to-day affair involving millions. So a solution to prevent customer attrition can save a lot. This paper propose to do comparison of three machine learning techniques Decision tree algorithm, Random Forest algorithm and Gradient Boosted tree algorithm using Apache Spark. Apache Spark is a data processing engine used in big data which provides in-memory processing so that the processing speed is higher. The analysis is made by extracting the features of the data set and training the model. Scala is a programming language that combines both object oriented and functional programming and so a powerful programming language. The analysis is implemented using Apache Spark and modelling is done using scala ML. The accuracy of Decision tree model came out as 86%, Random Forest model is 87% and Gradient Boosted tree is 85%.

Download Full-text

PNNCP- Parallel Nearest Neighbor Classification and Prediction for Big Data Application Based on Apache Spark and Machine Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1382.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2358-2365

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Streams ◽

Nearest Neighbor ◽

Apache Spark ◽

Course Of Action ◽

Huge Data ◽

Data Application ◽

Big Data Application ◽

Neighbor Classification

Right by and by the Colossal Information applications, for case, social orchestrating, helpful human administrations, agribusiness, keeping cash, stock show, direction, Facebook and so forward are making the data with especially tall speed. Volume and Speed of the Immense data plays a fundamental bit interior the execution of Colossal data applications. Execution of the Colossal data application can be affected by distinctive parameters. Quickly watch, capacity and precision are the a significant parcel of the triumphant parameters which impact the by and gigantic execution of any Huge data applications. Due the energize and underhanded affiliation of the qualities of 7Vs of Colossal data, each Colossal Information affiliations expect the tall execution.Tall execution is the foremost obvious test within the display advancing condition. In this paper we propose the parallel course of action way to bargain with speedup the explore for closest neighbor center. k-NN classifier is the preeminent basic and comprehensively utilized method for gathering. In this paper we apply a parallelism thought to k-NN for looking the another closest neighbor. This neighbor center will be utilized for putting lost and execution of the remarkable data streams. This classifier unequivocally overhaul and coordinate of the out of date data streams. We are utilizing the Apache Begin and scattered estimation space affiliation for snappier evaluation.

Download Full-text

Machine Learning in Apache Spark Environment for Diagnosis of Diabetes

10.20944/preprints202111.0200.v1 ◽

2021 ◽

Author(s):

Farshid Bagheri Saravi ◽

Shadi Moghanian ◽

Giti Javidi ◽

Ehsan O Sheybani

Keyword(s):

Machine Learning ◽

Data Mining ◽

Support Vector Machine ◽

Big Data ◽

Random Forest ◽

Apache Spark ◽

Support Vector ◽

Computing Environment ◽

Large Dataset ◽

Related Data

Disease-related data and information collected by physicians, patients, and researchers seem insignificant at first glance. Still, the same unorganized data contain valuable information that is often hidden. The task of data mining techniques is to extract patterns to classify the data accurately. One of the various Data mining and its methods have been used often to diagnose various diseases. In this study, a machine learning (ML) technique based on distributed computing in the Apache Spark computing space is used to diagnose diabetics or hidden pattern of the illness to detect the disease using a large dataset in real-time. Implementation results of three ML techniques of Decision Tree (DT) technique or Random Forest (RF) or Support Vector Machine (SVM) in the Apache Spark computing environment using the Scala programming language and WEKA show that RF is more efficient and faster to diagnose diabetes in big data.

Download Full-text

Effective Selection of Machine Learning Algorithms for Big Data Analytics Using Apache Spark

Advances in Intelligent Systems and Computing - Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016 ◽

10.1007/978-3-319-48308-5_66 ◽

2016 ◽

pp. 692-704 ◽

Cited By ~ 4

Author(s):

Manar Mohamed Hafez ◽

Mohamed Elemam Shehab ◽

Essam El Fakharany ◽

Abd El Ftah Abdel Ghfar Hegazy

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analytics ◽

Learning Algorithms ◽

Big Data Analytics ◽

Machine Learning Algorithms ◽

Apache Spark ◽

Effective Selection ◽

Selection Of

Download Full-text

Understanding and optimizing the performance of distributed machine learning applications on apache spark

2017 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2017.8257942 ◽

2017 ◽

Cited By ~ 2

Author(s):

Celestine Dunner ◽

Thomas Parnell ◽

Kubilay Atasu ◽

Manolis Sifalakis ◽

Haralampos Pozidis

Keyword(s):

Machine Learning ◽

Apache Spark ◽

Machine Learning Applications ◽

Distributed Machine Learning

Download Full-text