The ability to tolerate failures while effectively exploiting the grid computing resources in an scalable and transparent manner must be an integral part of grid computing infrastructure. Hence, fault-detection service is a necessary prerequisite to fault tolerance and fault recovery in grid computing. To this end, we present an scalable fault detection service architecture. The proposed fault-detection system provides services that monitors user applications, grid middlewares and the dynamically changing state of a collection of distributed resources. It reports summaries of this information to the appropriate agents on demand or instantaneously in the event of failures.
Field of Research
080307 Operating Systems
Socio Economic Objective
970108 Expanding Knowledge in the Information and Computing Sciences