According to an aspect, a method for triggering creation of a checkpoint in a computer system includes executing a task in a processing node of the computer system and determining whether it is time to read a monitor associated with a metric of the task. The monitor is read to determine a value of the metric based on determining that it is time to read the monitor. A threshold for triggering creation of the checkpoint is determined based on the value of the metric. Based on determining that the value of the metric has crossed the threshold, the checkpoint including state data of the task is created to enable restarting execution of the task upon a restart operation.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
This invention was made with Government support under contract number B599858 awarded by the Department of Energy. The Government has certain rights in this invention.