No. 35 (211), issue 6Pages 91 - 103 ON PROGRAM RESTORATION FROM CHECKPOINTS SET
A.Y. PolyakovIn paper two approaches to distributed programs restore problem from checkpoints set are described. Computation node wide algorithm of parent-child relationships and group/session assignement recreation at restore time is proposed. Also coordinated algorithm for process set restoration from several nodes/terminals is designed. Described algorightms are implemented in checkpointing package called extitDMTCP ( extitDistributed MultiThreaded CheckPointing).
Full text- Keywords
- HPC, rollback-recovery, checkpointing, fault tolerance