Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions
2018 ◽
Vol 33
(2)
◽
pp. 366-383
Keyword(s):
We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.
2017 ◽
pp. 494-504
◽
Keyword(s):
2017 ◽
Vol 32
(5)
◽
pp. 627-640
Keyword(s):