Understanding and optimizing the performance of distributed machine learning applications on apache spark

With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.

Download Full-text

Model averaging in distributed machine learning: a case study with Apache Spark

The VLDB Journal ◽

10.1007/s00778-021-00664-7 ◽

2021 ◽

Author(s):

Yunyan Guo ◽

Zhipeng Zhang ◽

Jiawei Jiang ◽

Wentao Wu ◽

Ce Zhang ◽

...

Keyword(s):

Machine Learning ◽

Model Averaging ◽

Apache Spark ◽

Distributed Machine Learning

Download Full-text

Predicting Diabetes using Distributed Machine Learning based on Apache Spark*

2020 International Conference on Innovative Trends in Communication and Computer Engineering (ITCE) ◽

10.1109/itce48509.2020.9047795 ◽

2020 ◽

Author(s):

Hager Ahmed ◽

Eman M.G. Younis ◽

Abdelmgeid A. Ali

Keyword(s):

Machine Learning ◽

Apache Spark ◽

Distributed Machine Learning

Download Full-text

Iktishaf+: A Big Data Tool with Automatic Labeling for Road Traffic Social Sensing and Event Detection Using Distributed Machine Learning

Sensors ◽

10.3390/s21092993 ◽

2021 ◽

Vol 21 (9) ◽

pp. 2993

Author(s):

Ebtesam Alomari ◽

Iyad Katib ◽

Aiiad Albeshri ◽

Tan Yigitcanlar ◽

Rashid Mehmood

Keyword(s):

Machine Learning ◽

Social Media ◽

Saudi Arabia ◽

Big Data ◽

Prior Knowledge ◽

Event Detection ◽

Road Traffic ◽

Arabic Language ◽

Apache Spark ◽

Distributed Machine Learning

Digital societies could be characterized by their increasing desire to express themselves and interact with others. This is being realized through digital platforms such as social media that have increasingly become convenient and inexpensive sensors compared to physical sensors in many sectors of smart societies. One such major sector is road transportation, which is the backbone of modern economies and costs globally 1.25 million deaths and 50 million human injuries annually. The cutting-edge on big data-enabled social media analytics for transportation-related studies is limited. This paper brings a range of technologies together to detect road traffic-related events using big data and distributed machine learning. The most specific contribution of this research is an automatic labelling method for machine learning-based traffic-related event detection from Twitter data in the Arabic language. The proposed method has been implemented in a software tool called Iktishaf+ (an Arabic word meaning discovery) that is able to detect traffic events automatically from tweets in the Arabic language using distributed machine learning over Apache Spark. The tool is built using nine components and a range of technologies including Apache Spark, Parquet, and MongoDB. Iktishaf+ uses a light stemmer for the Arabic language developed by us. We also use in this work a location extractor developed by us that allows us to extract and visualize spatio-temporal information about the detected events. The specific data used in this work comprises 33.5 million tweets collected from Saudi Arabia using the Twitter API. Using support vector machines, naïve Bayes, and logistic regression-based classifiers, we are able to detect and validate several real events in Saudi Arabia without prior knowledge, including a fire in Jeddah, rains in Makkah, and an accident in Riyadh. The findings show the effectiveness of Twitter media in detecting important events with no prior knowledge about them.

Download Full-text

Machine learning applications for shock train diagnostics

AIAA Scitech 2021 Forum ◽

10.2514/6.2021-1878 ◽

2021 ◽

Author(s):

Jared Chin ◽

Mirko Gamba

Keyword(s):

Machine Learning ◽

Shock Train ◽

Machine Learning Applications

Download Full-text

Exploring the Applications of Machine Learning in Healthcare

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327910666191220103417 ◽

2020 ◽

Vol 10 (4) ◽

pp. 458-472

Author(s):

Tausifa Jan Saleem ◽

Mohammad Ahsan Chishti

Keyword(s):

Machine Learning ◽

Disease Risk ◽

Disease Diagnosis ◽

Machine Intelligence ◽

Healthcare Applications ◽

Comprehensive Overview ◽

Machine Learning Applications ◽

Remote Healthcare ◽

Healthcare Monitoring ◽

Applications Of Machine Learning

The rapid progress in domains like machine learning, and big data has created plenty of opportunities in data-driven applications particularly healthcare. Incorporating machine intelligence in healthcare can result in breakthroughs like precise disease diagnosis, novel methods of treatment, remote healthcare monitoring, drug discovery, and curtailment in healthcare costs. The implementation of machine intelligence algorithms on the massive healthcare datasets is computationally expensive. However, consequential progress in computational power during recent years has facilitated the deployment of machine intelligence algorithms in healthcare applications. Motivated to explore these applications, this paper presents a review of research works dedicated to the implementation of machine learning on healthcare datasets. The studies that were conducted have been categorized into following groups (a) disease diagnosis and detection, (b) disease risk prediction, (c) health monitoring, (d) healthcare related discoveries, and (e) epidemic outbreak prediction. The objective of the research is to help the researchers in this field to get a comprehensive overview of the machine learning applications in healthcare. Apart from revealing the potential of machine learning in healthcare, this paper will serve as a motivation to foster advanced research in the domain of machine intelligence-driven healthcare.

Download Full-text

Learning and control

10.1093/oso/9780199674923.003.0026 ◽

2018 ◽

Author(s):

Ivan Herreros

Keyword(s):

Machine Learning ◽

Reinforcement Learning ◽

Brain Function ◽

Control Strategies ◽

Learning Problems ◽

Animal Learning ◽

Feed Forward Control ◽

Machine Learning Applications ◽

And Control

This chapter discusses basic concepts from control theory and machine learning to facilitate a formal understanding of animal learning and motor control. It first distinguishes between feedback and feed-forward control strategies, and later introduces the classification of machine learning applications into supervised, unsupervised, and reinforcement learning problems. Next, it links these concepts with their counterparts in the domain of the psychology of animal learning, highlighting the analogies between supervised learning and classical conditioning, reinforcement learning and operant conditioning, and between unsupervised and perceptual learning. Additionally, it interprets innate and acquired actions from the standpoint of feedback vs anticipatory and adaptive control. Finally, it argues how this framework of translating knowledge between formal and biological disciplines can serve us to not only structure and advance our understanding of brain function but also enrich engineering solutions at the level of robot learning and control with insights coming from biology.

Download Full-text

Big data Predictive Analytics for Apache Spark using Machine Learning

2020 Global Conference on Wireless and Optical Technologies (GCWOT) ◽

10.1109/gcwot49901.2020.9391620 ◽

2020 ◽

Author(s):

Muhammad Junaid ◽

Shiraz Ali Wagan ◽

Nawab Muhammad Faseeh Qureshi ◽

Choon Sung Nam ◽

Dong Ryeol Shin

Keyword(s):

Machine Learning ◽

Big Data ◽

Predictive Analytics ◽

Apache Spark

Download Full-text

Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology

Machine Learning and Knowledge Extraction ◽

10.3390/make3020020 ◽

2021 ◽

Vol 3 (2) ◽

pp. 392-413

Author(s):

Stefan Studer ◽

Thanh Binh Bui ◽

Christian Drescher ◽

Alexander Hanuschkin ◽

Ludwig Winkler ◽

...

Keyword(s):

Machine Learning ◽

Quality Assurance ◽

Process Model ◽

Practical Experience ◽

Special Focus ◽

Close Monitoring ◽

Machine Learning Applications ◽

Project Organizations ◽

Considerable Impact ◽

Learning Development

Machine learning is an established and frequently used technique in industry and academia, but a standard process model to improve success and efficiency of machine learning applications is still missing. Project organizations and machine learning practitioners face manifold challenges and risks when developing machine learning applications and have a need for guidance to meet business expectations. This paper therefore proposes a process model for the development of machine learning applications, covering six phases from defining the scope to maintaining the deployed machine learning application. Business and data understanding are executed simultaneously in the first phase, as both have considerable impact on the feasibility of the project. The next phases are comprised of data preparation, modeling, evaluation, and deployment. Special focus is applied to the last phase, as a model running in changing real-time environments requires close monitoring and maintenance to reduce the risk of performance degradation over time. With each task of the process, this work proposes quality assurance methodology that is suitable to address challenges in machine learning development that are identified in the form of risks. The methodology is drawn from practical experience and scientific literature, and has proven to be general and stable. The process model expands on CRISP-DM, a data mining process model that enjoys strong industry support, but fails to address machine learning specific tasks. The presented work proposes an industry- and application-neutral process model tailored for machine learning applications with a focus on technical tasks for quality assurance.

Download Full-text

Understanding and optimizing the performance of distributed machine learning applications on apache spark

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

A unified schedule policy of distributed machine learning framework for CPU-GPU cluster

Model averaging in distributed machine learning: a case study with Apache Spark

Predicting Diabetes using Distributed Machine Learning based on Apache Spark*

Iktishaf+: A Big Data Tool with Automatic Labeling for Road Traffic Social Sensing and Event Detection Using Distributed Machine Learning

Machine learning applications for shock train diagnostics

Exploring the Applications of Machine Learning in Healthcare

Learning and control

Big data Predictive Analytics for Apache Spark using Machine Learning

Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology

Export Citation Format