Architecting Dependable Systems

Introduction

The current challenge faced by system architects is how to build dependable systems from existing undependable components and/or systems, such as off-the-shelf (OTS) components and legacy systems, that were not originally designed to interact with each other. One major problem when using existing components is the inability to change, or even to access, their internal designs and implementations. Moreover, since the evolution of these components might be outside the control of the system architect, solutions that are dependent on a specific component implementation become unfeasible. Based on these limitations, the delivery of correct service, and the justification of this ability, has to be obtained from the components interfaces and their interactions rather than from their internal designs or implementations.

System dependability is measured through its attributes, such as reliability, availability, confidentiality, and integrity. The technologies for attaining these attributes can be grouped into four major categories [1]: rigorous design, verification and validation, fault tolerance, and system evaluation. In the following, we present how architectural level reasoning might improve these technologies [3].

Rigorous design , also known as fault prevention, is concerned with development activities that introduce rigour in the design and implementation of systems for preventing the introduction of faults or their occurrence during operation. Development methodologies and construction techniques for preventing the introduction and occurrence of faults can be described respectively from the perspective of development faults and configuration faults (a type of interaction faults) [1]. In the context of software development, the architectural representation of a software system plays a critical role in reducing the number of faults that might be introduced. One way of preventing development faults is the usage of formal or rigorous notations for representing and analysing software architectures. The starting point of any development should be the architectural model of a system in which dependability attributes of its components are clearly documented, together with the static and dynamic properties of their interfaces. Also as part of these models, assumptions should be documented about the required and provided behaviour of the components, including their failure assumptions. One way of preventing configuration faults from occurring during system operation is to protect components and their context against potential architectural mismatches (design faults) that might exist between them. These vulnerabilities can be prevented by adding to the structure of the system architectural solutions based on integrators (more commonly known as wrappers). The assumption here is that the integrators are aware of all incompatibilities that might exist between a component and its environment.

Verification and validation , also known as fault removal, is concerned with development and post-deployment activities that aim at reducing the number or severity of faults [1]. The role of architectural representations in the removal of faults during development is twofold: first, it allows faults to be identified and removed early in the development process, and second, it provides the basis for removing faults late in the process. The early removal of faults entails checking whether the architectural description adheres to given properties associated with particular architectural styles, and whether the architectural description is an accurate representation of the requirements specifications. The late removal of faults entails checking whether the implementation fulfils the architectural specification. The role of architectural representation in the removal of faults after system deployment includes both corrective and preventative maintenance [1]. The software architecture provides a good starting point for revealing the areas a prospective change will affect.

Fault tolerance aims to avoid system failure via error detection and system recovery at run-time [1]. Error detection at the architectural level relies on monitoring mechanisms, or probes, for detecting erroneous states at the interfaces of architectural elements or in the interactions between these elements. On the other hand, the aim of system recovery is twofold. First, eliminate erroneous states from the system, and second, reconfigure the system architecture for isolating those architectural elements that might have caused the erroneous states . Architectural abstractions offer a number of features that are suitable for the provision of fault tolerance, including error confinement, which is the ability of a system to avoid the propagation of errors. They also provide a global perspective of the system that enables high-level interpretation of system faults, thus facilitating their identification. The separation between computation and communication, which enforces modularisation and information hiding, facilitates error detection, confinement, and system recovery. The architectural configuration, being structural constraints, helps to identify anomalies in the system structure. The role of software architectures in error confinement needs to be approached from two distinct angles. On one hand is the support for fostering the creation of architectural structures that provide error confinement, and on the other hand is the representation and analysis of error confinement mechanisms. Explicit system structuring facilitates the introduction of mechanisms such as program assertions, pre- and post-conditions, and invariants that enable the detection of potential erroneous architectural states. Thus, having a highly cohesive system with self-checking architectural elements is essential for error confinement. Architectural changes, for supporting fault handling during system recovery, can include the addition, removal, or replacement of components and connectors, modifications to the configuration or parameters of components and connectors, and alterations in the component/connector network's topology.

System evaluation , also known as fault forecasting, is performed by evaluating systems' behaviour with respect to fault occurrence or activation [1]. The architectural evaluation of a system should be carried out in terms of system failure modes, and the combination of component and/or connector failures that would lead to system failure. Instead of having the precise characterisation of a dependability attribute, the goal should be the early identification of risks to dependability based on the high-level structures of the system, and to reason what is the impact that an architectural decision might have upon a dependability attribute [2]. At such early stage of development, the actual parameters that are able to characterise an attribute are not yet known since they are often implementation dependent. Nevertheless, the architectural evaluation of a system can be done either qualitatively or quantitatively. Qualitative architectural evaluation aims to provide evidence whether the architecture is suitable with respect to some goals and problematic towards other goals. This is obtained by employing questionnaires, checklists and scenarios to investigate the way an architecture addresses its dependability requirements in the presence of failures [2]. Quantitative architectural evaluation aims to estimate in terms of probabilities whether the dependability attributes are satisfied. Probability estimation through modelling relies on either architectural simulation or metrics extracted from an architectural representation. In the context of dependability, most of the approaches rely on the construction of stochastic processes for modelling system components and their interactions, in terms of failures and repairs. Alternatively, fault injection techniques can be used for evaluating the dependability of the system.

References

[1] A. Avizienis, J.-C. Laprie, B. Randell, C. Landwehr. “Basic Concepts and Taxonomy of Dependable and Secure Computing”. IEEE Transactions on Dependable and Secure Computing 1(1). January-March 2004. pp. 11-33.

[2] P. Clements, R. Kazman, M. Klein. Evaluating Software Architectures: Methods and Case Studies. Addison-Wesley. 2002.

[3] C. Gacek, R. de Lemos. “Architectural Description of Dependable Software Systems”. Structure for Dependability: Computer-Based Systems from an Interdisciplinary Perspective. D. Besnard, C. Gacek, C. B. Jones (Eds.). Springer-Verlag. London , UK . 2006. pp. 127-142.

Maintained by Rogério de Lemos ( r.delemos[at]kent.ac.uk)
Last updated 4 December 2008