Skip to Content
Find More Like This
Return to Search

Template based parallel checkpointing in a massively parallel computer system

United States Patent

January 13, 2009
View the Complete Patent at the US Patent & Trademark Office
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Archer; Charles Jens (Rochester, MN), Inglett; Todd Alan (Rochester, MN)
International Business Machines Corporation (Armonk, NY)
11/ 106,010
April 14, 2005
STATEMENT REGARDING FEDERALLY SPONSORED RESEAECH OR DEVELOPMENT This invention was made with Government support under Contract No. B519700 awarded by the Department of Energy. The Government has certain rights in this invention.