Optimizing resources to mitigate stragglers through virtualization in run time
Modern computing systems are generally enormous in scale, consisting of hundreds to thousands of heterogeneous machine nodes, to meet rising demands for Cloud services. MapReduce and other parallel computing frameworks are frequently used on such cluster architecture to offer consumers dependable and timely services. However, Cloud workloads’ complex features, such as multi-dimensional resource requirements and dynamically changing system settings, such as dynamic node performance, are posing new difficulties for providers in terms of both customer experience and system efficiency. The straggler problem occurs when a small subset of parallelized jobs takes an excessively long time to execute in contrast to their siblings, resulting in a delayed job response and the possibility of late-timing failure. Speculative execution is the state-of-the-art method to straggler mitigation. Speculative execution has been used in numerous real-world systems with a variety of implementation improvements, but the results of this thesis’ research demonstrate that it is typically wasteful. The failure rate of speculative execution might be as high as 71 percent, according to different data center production trace logs. Straggler mitigation is a difficult task in and of itself: 1) stragglers may have varying degrees of severity in parallel job execution; 2) whether a task should be considered a straggler is highly subjective, depending on various application and system conditions; 3) the efficiency of speculative execution would be improved if dynamic node quality could be adequately modeled and predicted; 4) Other sorts of stragglers, such as those generated by data skews, are beyond speculative execution’s capabilities.