Abstract
Several millions of execution flows will be executed in ultrascale computing systems (UCS), and the task for the programmer to understand their coherency and for the runtime to coordinate them is unfathomable. Moreover, related to UCS large scale and their impact on reliability, the current static point of view is not more sufficient. A runtime cannot consider to restart an application because of the failure of a single node as statically several nodes will fail every day. Classical management of these failures by the programmers using checkpoint restart is also too limited due to the overhead at such a scale. The article explores programming models and runtimes required to facilitate the task of scaling and extracting performance on continuously evolving platforms, while providing resilience and fault-tolerant mechanisms to tackle the increasing probability of failures throughout the whole software stack.
Original language | English |
---|---|
Title of host publication | Ultrascale Computing Systems |
Publisher | Institution of Engineering and Technology |
Pages | 9-64 |
Number of pages | 56 |
ISBN (Electronic) | 9781785618345 |
ISBN (Print) | 9781785618338 |
DOIs | |
Publication status | Published - 1 Jan 2019 |
Keywords
- Checkpoint restart
- Checkpointing
- Distributed programming
- Failure management
- Fault-tolerant mechanisms
- Programming models
- Runtimes
- Software fault tolerance
- Software stack
- Ultrascale computing systems