Proteus: Distributed machine learning task scheduling based on Lyapunov optimization

2021 ◽

Vol 39 (3) ◽

pp. 529-538

Author(s):

Ziyu Zhu ◽

Xiaochun Tang ◽

Quan Zhao

Keyword(s):

Machine Learning ◽

Task Scheduling ◽

Gpu Computing ◽

Computing Power ◽

Hybrid Cluster ◽

Efficiency Of Algorithms ◽

Machine Learning Applications ◽

Gpu Clusters ◽

The Difference ◽

Distributed Machine Learning

With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.

Download Full-text

Joint Data Collection and Resource Allocation for Distributed Machine Learning at the Edge

IEEE Transactions on Mobile Computing ◽

10.1109/tmc.2020.3045436 ◽

2020 ◽

pp. 1-1

Author(s):

Min Chen ◽

Haichuan Wang ◽

Zeyu Meng ◽

Hongli Xu ◽

Yang Xu ◽

...

Keyword(s):

Machine Learning ◽

Resource Allocation ◽

Data Collection ◽

Distributed Machine Learning

Download Full-text

PODC 2020 Review

ACM SIGACT News ◽

10.1145/3444815.3444827 ◽

2021 ◽

Vol 51 (4) ◽

pp. 75-81

Author(s):

Ahad Mirza Baig ◽

Alkida Balliu ◽

Peter Davies ◽

Michal Dory

Keyword(s):

Machine Learning ◽

Distributed Computing ◽

Keynote Speaker ◽

Lively Discussion ◽

Theoretical Understanding ◽

New Directions ◽

New Ideas ◽

New Challenges ◽

The Impact ◽

Distributed Machine Learning

Rachid Guerraoui was the rst keynote speaker, and he got things o to a great start by discussing the broad relevance of the research done in our community relative to both industry and academia. He rst argued that, in some sense, the fact that distributed computing is so pervasive nowadays could end up sti ing progress in our community by inducing people to work on marginal problems, and becoming isolated. His rst suggestion was to try to understand and incorporate new ideas coming from applied elds into our research, and argued that this has been historically very successful. He illustrated this point via the distributed payment problem, which appears in the context of blockchains, in particular Bitcoin, but then turned out to be very theoretically interesting; furthermore, the theoretical understanding of the problem inspired new practical protocols. He then went further to discuss new directions in distributed computing, such as the COVID tracing problem, and new challenges in Byzantine-resilient distributed machine learning. Another source of innovation Rachid suggested was hardware innovations, which he illustrated with work studying the impact of RDMA-based primitives on fundamental problems in distributed computing. The talk concluded with a very lively discussion.

Download Full-text

MODES: model-based optimization on distributed embedded systems

Machine Learning ◽

10.1007/s10994-021-06014-6 ◽

2021 ◽

Author(s):

Junjie Shi ◽

Jiang Bian ◽

Jakob Richter ◽

Kuan-Hsun Chen ◽

Jörg Rahnenführer ◽

...

Keyword(s):

Machine Learning ◽

Embedded Systems ◽

Learning Model ◽

Black Box ◽

Distributed Embedded Systems ◽

Data Set ◽

Individual Model ◽

Model Based ◽

Machine Learning Model ◽

Distributed Machine Learning

AbstractThe predictive performance of a machine learning model highly depends on the corresponding hyper-parameter setting. Hence, hyper-parameter tuning is often indispensable. Normally such tuning requires the dedicated machine learning model to be trained and evaluated on centralized data to obtain a performance estimate. However, in a distributed machine learning scenario, it is not always possible to collect all the data from all nodes due to privacy concerns or storage limitations. Moreover, if data has to be transferred through low bandwidth connections it reduces the time available for tuning. Model-Based Optimization (MBO) is one state-of-the-art method for tuning hyper-parameters but the application on distributed machine learning models or federated learning lacks research. This work proposes a framework $$\textit{MODES}$$ MODES that allows to deploy MBO on resource-constrained distributed embedded systems. Each node trains an individual model based on its local data. The goal is to optimize the combined prediction accuracy. The presented framework offers two optimization modes: (1) $$\textit{MODES}$$ MODES -B considers the whole ensemble as a single black box and optimizes the hyper-parameters of each individual model jointly, and (2) $$\textit{MODES}$$ MODES -I considers all models as clones of the same black box which allows it to efficiently parallelize the optimization in a distributed setting. We evaluate $$\textit{MODES}$$ MODES by conducting experiments on the optimization for the hyper-parameters of a random forest and a multi-layer perceptron. The experimental results demonstrate that, with an improvement in terms of mean accuracy ($$\textit{MODES}$$ MODES -B), run-time efficiency ($$\textit{MODES}$$ MODES -I), and statistical stability for both modes, $$\textit{MODES}$$ MODES outperforms the baseline, i.e., carry out tuning with MBO on each node individually with its local sub-data set.

Download Full-text

Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication

IEEE Transactions on Network Science and Engineering ◽

10.1109/tnse.2021.3073897 ◽

2021 ◽

pp. 1-1

Author(s):

Yubin Duan ◽

Ning Wang ◽

Jie Wu

Keyword(s):

Machine Learning ◽

Data Communication ◽

Training Time ◽

Distributed Machine Learning

Download Full-text

Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks

Nature Communications ◽

10.1038/s41467-021-23103-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Abdulkadir Canatar ◽

Blake Bordelon ◽

Cengiz Pehlevan

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Kernel Regression ◽

Learning Task ◽

Learning Curves ◽

Generalization Error ◽

Theoretical Understanding ◽

Classical Statistics ◽

Deep Networks ◽

Model Alignment

AbstractA theoretical understanding of generalization remains an open problem for many machine learning models, including deep networks where overparameterization leads to better performance, contradicting the conventional wisdom from classical statistics. Here, we investigate generalization error for kernel regression, which, besides being a popular machine learning method, also describes certain infinitely overparameterized neural networks. We use techniques from statistical mechanics to derive an analytical expression for generalization error applicable to any kernel and data distribution. We present applications of our theory to real and synthetic datasets, and for many kernels including those that arise from training deep networks in the infinite-width limit. We elucidate an inductive bias of kernel regression to explain data with simple functions, characterize whether a kernel is compatible with a learning task, and show that more data may impair generalization when noisy or not expressible by the kernel, leading to non-monotonic learning curves with possibly many peaks.

Download Full-text