Distributed Repair Timers.

T Herman, University of Iowa Department of Computer Science Technical Report TR 98-05:

Postscript Document and DVI file .


Abstract

Certain types of system faults, notably data errors, can be repaired by software. Repair consists of locating faulty components and then rewriting data to correct the errors. Faults should first be identified to initiate repair, however if fault identification is imprecise, optimism can be a reasonable heuristic. A timer can be a useful ingredient in an optimistic repair procedure: a timer can force termination of an inaccurate repair and can also delay installation of repaired values until accuracy is verified.

This paper proposes requirements for repair timers, which are closely related to phase synchronizers (used to control progress of distributed algorithms). Faults are considered to be spontaneous corruptions of state information (transient faults), and repairs consist of assigning data values to obtain a correct system state. The repair timer requirements are developed for a distributed system model and take into account global and local needs for the use of the timer by distributed repair tasks. The requirements specify constraints on how the repair timer self-stabilizes from faults. Following presentation of the requirements, the paper presents an algorithm for the repair timer.