In order to control an internally or externally provided IT service, service level agreements (SLA) are formulated between service provider and consumer. Besides other information, SLA include guarantees for quality metrics to ensure a fast, safe, and secure IT service provision. One of the most crucial quality metrics for consumers is the service availability.
The IT Infrastructure Library (ITIL) defines availability as “the ability of a service […] to perform its agreed function when required”, thus this metric is at “the core of customer satisfaction” for IT service providers. If service availability falls below the agreed level, penalty costs will arise for the provider in the short-term, but the loss of reputation may be harder to deal with in the long-term. Hence it is crucial for IT service providers to ensure sufficient availability levels – of course at a minimum cost.Four principal approaches can be identified to increase availability:The first two of these approaches may be considered in the service design stage. However, the success of fault prevention is limited due to the fact that faults can never be excluded completely. Therefore, fault tolerance mechanisms are a powerful but expensive approach to increase availability.
In order to tradeoff costs and effects of redundancy mechanisms, a redundancy allocation problem (RAP) can be defined. With regard to this combinatorial optimization problem, a service is divided into its subsystems in which redundancy can be applied. Using a combination of meta-heuristics and simulation, combinations of redundancy mechanisms can be identified for which availability and costs are nearly optimal. Following a Pareto approach, the results can be visualized as a Pareto front which represents the trade-off between both objectives, as can be seen in this figure. On this basis, a decision maker in the service design stage can choose a suitable redundancy design from the points depicted.