The computer failure data repository

Bianca Schroeder, Garth A. Gibson

Workshop on Reliability Analysis of System Failure Data, Cambridge, UK, March 2007

 

Abstract

System reliability is a major challenge in system design. Unreli- able systems are not only major source of user frustration, they are also expensive. Avoiding downtime and the cost of actual downtime make up more than 40% of the total cost of ownership for modern IT systems. Unfortunately, with the large component count in today’s large-scale systems, failures are quickly becoming the norm rather than the exception.<BR> This submission describes an effort currently underway at CMU to create a public Computer Failure Data Repository (CFDR), sponsored by USENIX. The goal of the repository is to accelerate research on system reliability by filling the nearly empty collection of public data with detailed failure data from a variety of large production systems. Below we give a brief overview of the data sets we have collected so far, and discuss our ongoing efforts and the long-term goals of the CFDR.

 

Manuscript

Pdf

 

Bibtex

Bib