CaberNet vision of RTD

5. Dependable Systems

by David Powell (LAAS-CNRS, France)

Link to the SOTA Chapter

Ongoing Research / Future Directions

Dependability is defined as the property of a computer system that enables its users to place a justified reliance on the service it delivers. Dependability is a generic concept, generalizing the notions of availability, reliability, integrity, confidentiality, maintainability, safety and security. The aim of current research is to define methods for protecting and assessing systems with respect to a wide spectrum of faults that can be broadly classified in five classes [Avizienis 2001]: physical faults, non-malicious design faults, malicious design faults, non-malicious interaction faults, and malicious interaction faults (intrusions). The methods can be categorized as fault prevention, fault tolerance, fault removal and fault forecasting. Fault prevention and fault removal are sometimes considered together as constituting fault avoidance, as opposed to fault tolerance and fault forecasting, which together constitute fault acceptance.

Fault Prevention

Fault prevention aims to prevent the occurrence or the introduction of faults. It consists in developing systems in such a way as to prevent the introduction of design and implementation faults, and to prevent faults from occurring during operation. In this context, any general engineering technique aimed at introducing rigor into the design process can be considered as constituting fault prevention. However, some areas currently being researched are more specific to the dependable computing community. One such area is the formal definition of security policies in order to prevent the introduction of vulnerabilities. The definition of a security policy consists in identifying the properties that must be satisfied and the rules that applications and organizations must obey in order to satisfy them. For example, work being carried out in the MP6 project in France is specifically aimed at defining role-based access control policies applicable to information systems in the health and social sectors.

Another area of active research into fault prevention concerns the human factors issues in critical "socio-technical" systems. In particular, research initiated at York University on the allocation of functions between humans and machines [Dearden 2000] has served as the basis of a prototype database tool to assist communication between system designers and human factors experts [Mersiol 2002]. In the UK, the Interdisciplinary Research Collaboration in Dependability of Computer-Based Systems (DIRC) has a strong human factors component.

Fault Tolerance

Fault-tolerance techniques aim to ensure that a system fulfills its function despite faults [Arlat 1999]. Current research is centered on distributed fault-tolerance techniques (including fault-tolerance techniques for embedded systems), wrapping and reflection technologies for facilitating the implementation of fault-tolerance, and the generalization of the tolerance paradigm to include deliberately malicious faults, i.e., intrusion-tolerance.

Distributed fault-tolerance techniques aim to implement redundancy techniques using software, usually through a message-passing paradigm. As such, much of the research in the area is concerned with the definition of distributed algorithms for fault-tolerance. In closed, embedded systems the design of such algorithms may be simplified if it is possible to substantiate the strong assumptions underlying the synchronous system model. Therefore, most current research on fault-tolerance in such systems follows this approach, often using the time-triggered paradigm [Elmenreich 2002] [Powell 1999] [Powell 2001] [Steiner 2002]. Note also that, especially in embedded systems, state-of-the-art fault tolerance techniques cannot ignore that most faults experienced in real systems are transient faults [Bondavalli 2000a].

In many distributed systems, however, especially large-scale systems, it is difficult to substantiate the strong assumptions underlying the synchronous system model, so several teams are defining paradigms able to deal with asynchrony (see, for example, [Mostéfaoui 2001]). One approach being followed at the University of Lisbon is to consider a reliable timing channel for control signals that is separate from the asynchronous channel used to carry payload traffic [Casimiro 2002] [Veríssimo 2002]. An alternative approach is to consider a timed asynchronous model, which involves making assumptions regarding the maximum drift of hardware clocks accessible from non-faulty processes. Using this model, a European-designed fail-safe redundancy management protocol [Essamé 1999] is currently being implemented in the context of the automation of the Carnesie subway line in New York [Powell 2002].

Other areas of active research concerns fault-tolerance in large, complex distributed applications. Of special note in this area are techniques aimed at the coordinated execution of exceptions in environments where multiple concurrent threads of execution act on persistent data [Beder 2001] [Tartanoglu 2002]; fault-tolerance in peer-to-peer systems [Montresor 2002] ; and mechanisms for dealing with errors that arise from architectural mismatches [de Lemos 2002].

The implementation of distributed fault-tolerance techniques is notoriously difficult and error-prone, especially when using off-the-shelf components (COTS) that typically (a) have ill-defined failure modes and (b) offer opaque interfaces that do not allow access to internal data without which fault-tolerance cannot be implemented. There is thus considerable interest in addressing these difficulties using wrapping technologies to improve the robustness of COTS components [Rodriguez 2000] and reflective technologies to allow introspection and intercession [Killijian 2000].

In the mid 1980's, the European dependability community had the (then) outrageous idea that the fault tolerance paradigm could also be extended to address security issues through the notion of intrusion-tolerance [Fraga 1985] [Dobson 1986] [Deswarte 1991]. Such techniques are now receiving a justified revival in interest as it is realized that intrusion prevention (though authentication, authorization, firewalls, etc.), like any other prevention technique, cannot offer absolute guarantees of security. Intrusion-tolerance is being addressed by the European MAFTIA project (see, e.g. [Deswarte 2001] [Correria 2002]) and, in the USA, is now the subject of a complete DARPA program (called OASIS, for Organically Assured & Survivable Information Systems), in which European researchers are also taking an active part through the DIT project.

Fault Removal

Fault removal, through verification techniques such as inspection, model-checking, theorem proving and testing, aims to reduce the number or the severity of faults. Current fault removal research within CaberNet is focused on software testing, especially with respect to faults (sometimes referred to as robustness testing), and on testing of fault-tolerance mechanisms (via fault injection).

One approach to software testing that has been investigated in depth at LAAS-CNRS is statistical testing, which is based on the notion of a test quality measured in terms of the coverage of structural or functional criteria. This notion of test quality enables the test criterion to be used to define a statistical test set, i.e., a probability distribution over the input domain, called a test profile, and the number of executions necessary to satisfy the quality objective. Recent research has focused on developing statistical test-sets from UML state diagram desciprions of real-time object-oriented software [Chevalley 2001a] and the assessment of test-sets for object-oriented programs using mutation analysis [Chevalley 2001b].

Robustness testing aims to assess how well a (software) component protects itself against erroneous inputs. Robustness testing, which is the focus of the AS23 project, aims to assess how well a (software) component protects itself against erroneous inputs. One approach for robustness testing that is being studied within the DSoS project is called "property-oriented testing". Here, the determination of test profiles is specifically aimed at verifying safety properties, typically of an embedded control system - heuristic search techniques are used to explore the input space (including both functional and non-functional inputs), attempting to push the system towards a violation of its required safety properties [Abdellatif 2001].

Fault injection can either be used for robustness testing, in the sense defined above, or as a means for testing fault-tolerance mechanisms with respect to the specific inputs of such mechanisms, i.e., the faults they are supposed to tolerate [Buchacker 2001]. In fact, due to the measures of robustness or coverage that this technique allows, fault injection is also often a means of experimental evaluation (see fault forecasting below).

Finally, it is worth mentioning that some current research is focused on testing the use of the reflective technologies considered earlier as a means for simplifying the implementation of faul-tolerance [Ruiz-Garcia 2001].

Fault Forecasting

Fault forecasting is concerned with the estimation of the presence, the creation and the consequences of faults. This is a very active and prolific field of research within the dependability community. Both analytical and experimental evaluation techniques are considered.

Analytical evaluation of system dependability is based on a stochastic model of the system's behavior in the presence of fault and (possibly) repair events [Haverkort 2001]. For realistic systems, two major issues are that of: (a) establishing a faithful and tractable model of the system's behavior [Fota 1999] [Kanoun 1999] [Betous-Almeida 2002] [Haverkort 2002], and (b) analysis procedures that allow the (possibly very large) model to be processed [Bell 2001]. Ideally, the analytical evaluation process should start as early as possible during development in order to make motivated design decisions between alternative approaches (see [Bondavalli 2001] for an example of some recent research in this direction). Specific areas of research in analytical evaluation include: systems with multiple phases of operation [Bondavalli 2000b] [Mura 2001]; and large Internet-based applications requiring a hierarchical modeling approach [Kaâniche 2001].

Experimental evaluation of system dependability relies on the collection of dependability data on real systems. The data of relevance concerns the times of or between dependability-relevant events such as failures and repairs; the data may be collected either during the test phase (see, e.g., [Littlewood 2000]) or during normal operation (see, e.g., [Simache 2001]). The observation of the behavior of a system in the presence of faults can be accelerated by means of fault-injection techniques (see also fault removal above), which constitute a very popular subject for recent and ongoing research. Most of this work on fault injection is based on software-implemented fault injection (SWIFI), for which several tools have been developed (see, e.g., [Carreira 1998] [Fabre 1999] [Höxer 2002] [Rodriguez 2002]). The data obtained from fault injection experiments can be processed statistically to characterize the target system's failure behavior in terms of its failure modes [Marsden 2001] or to assess the effectiveness of fault-tolerance mechanisms in terms of coverage [Cukier 1999] [Aidemark 2002]. Currently, especially in the context of the DBench project, there has been research into using fault injection techniques to build dependability benchmarks for comparing competing systems/solutions on an equitable basis [Kanoun 2002].

CaberNet Related Activities

AMSD (Accompanying Measure on System Dependability)
- University of Newcastle, UK
- LAAS - CNRS, France
- CNUCE-CNR, Italy
AMSD addresses the need for a coherent major initiative in FWP6 encompassing various aspects of dependability (reliability, safety, security, survivability, etc.); education and training; and means for encouraging and enabling sector-specific IST RTD projects to use dependability best practice. The results will be an overall dependability road-map that considers dependability in an adequately holistic way, and a detailed road-map for dependable embedded systems.
Analysis and Implementation of a System for Parameters Measuring and Operation Controlling of Automotive Diesel Engines

Universidad Politécnica de Valencia, Spain

The objective of this project is to design an open architecture for the control of a diesel engine in a benchmark laboratory.

AS23 (Advanced Testing Techniques for Complex Systems)
- LAAS - CNRS, France

This is a French national project launched by the STIC scientific department of CNRS. It addresses the robustness testing of systems with respect to erroneous or untimely inputs from their environment.

BAE SYSTEMS Systems Integration Consortium
- University of York, UK
- University of Newcastle, UK
Involves a collaboration between the Universities of York, Newcastle and Loughborough with BAE SYSTEMS. The work at York is targeted at software productivity improvement. This is divided into the following areas:

CAUTION++ (Capacity and Network Management Platform for increased Utilisation of Wireless Systems of Next Generation++)

CNUCE and IEI (CNR, Pisa), Italy
University of Florence, Italy

In the framework of a cooperation with the Motorola Technology Center Italy, dated a couple of years ago, the group started research on dependability in wireless systems. This activity was mainly centered on accurate availability and in general QoS analysis of GPRS systems [Tataranni 2001, Porcarelli 2002a, Porcarelli 2002b]. Now, it is going to continue in the framework of the European project IST-2001-38229 CAUTION++ (Capacity and network management platform for increased utilisation of wireless systems of next generation++), starting on November 2002. The main goal of this project is to design and develop a novel, low cost, flexible, highly efficient and scaleable system able to be utilized by mobile operators to increase the performance of all network segments.

Crystall (Correct Modular Group Communication Middleware)
- Ecole Polytechnique Fédérale de Lausanne, Switzerland

In this project we are interested in the design, verification and implementation of group communication using a modular approach,which is based on implementing properties required by an application as separate protocols, and then combining selected protocols using a software framework.

DARP (Defence Aerospace Research Partnership on High Integrity Real-Time Systems )
- University of York, UK
The DARP builds on two existing research centres which have been extremely successful in technology transfer. BAE SYSTEMS has funded the Dependable Computing Systems Centre (DCSC) at York and Newcastle since 1991. The DCSC focuses on safety-critical real-time systems and has produced important research results. Rolls-Royce has funded the University Technology Centre (UTC) in Systems and Software Engineering since 1993. Thus the aim has been to define research activities which are valuable, complementary to existing programmes, and to ensure synergy so that the results of the DARP programme can be integrated with the other work, to meet the technical and commercial challenges:
DBench (Dependability Benchmarking)
- LAAS-CNRS, France
- Chalmers University of Technology, Sweden
- Critical Software, Portugal
- Universität Erlangen-Nürnberg, Germany
- Microsoft Research, Cambridge, UK
- Universidad Politécnica de Valencia, Spain

This is a European IST project (project IST-2000-25425). DBench aims to define a conceptual framework and an experimental environment for benchmarking the dependability of commercial off-the-shelf components (COTS) and COTS-based systems. It will provide system developers and end-users with means for characterising and evaluating the dependability of a component or a system, identifying malfunctioning or weakest parts, requiring more attention, tuning a particular component to enhance its dependability, and comparing the dependability of alternative or competing solutions.

DCSC (BAE SYSTEMS Dependable Computing Systems Centre)
- University of York, UK
- University of Newcastle, UK
The DCSC was established in 1991 at the Universities of York and Newcastle. This was the start of a long-term relationship between British Aerospace (as it was then known) and its academic partners. British Aerospace became part of BAE SYSTEMS in December 1999 and the commitment to the relationship continues into the new millennium. The Research centre will achieve its mission by research and by technology transfer into the BAE SYSTEMS operating companies.
DEAR-COTS

Instituto Politécnico do Porto, Portugal

The main purpose of the DEAR-COTS project (funded by the Portuguese government - PRAXIS/P/EEI/14187/1998) was the specification of an architecture based on the use of commercial off-the-shelf (COTS) components, able to support distributed computer controlled systems where reliability and timeliness are major requirements. The group's involvement in the project allowed to contribute with studies and protocols for fault-tolerance in CAN networks [Pinho 2001], and to provide a transparent framework for the replication of hard real-time applications [Pinho 2002].

DEEM (Dependability Modeling and Evaluation Tool for PMS)

CNUCE and IEI (CNR, Pisa), Italy
University of Florence, Italy

Analytical dependability modeling of Phased Mission Systems (PMS), a class of systems whose operational life consists of a sequence of non-overlapping periods, called phases [Mura 2001]. Because of their deployment in critical applications, the dependability modeling and analysis of PMS is an issue of primary relevance. Our methodology, which exploits the power of the class of Markov regenerative stochastic Petri net models, allows to obtain an analytical solution with a low computational complexity, basically dominated by the cost of the separate analysis of the system inside each phase. This methodology is supported by the tool DEEM [Bondavalli 2000], a dependability modeling and evaluation tool for PMS, currently under development.

DepAuDE (Dependability for Embedded Automation Systems in Dynamic Environments with Intra-site and Inter-site Distribution Aspects)

Katholieke Universiteit Leuven , Belgium

The project's goal is to develop a methodology and an architecture to improve dependability for non-safety critical, distributed, embedded automation systems with both IP (inter-site) and dedicated (intra-site) connections.

Design and Realization of Survivable Computer Systems and Networks

University of Hamburg, Germany

Survivable systems are known to be resistant to different kinds of problems. Among these are failures due to software or hardware faults, but also attacks caused by computer criminals. The design and implementation of survivable systems therefore requires a variety of different steps to support system analysis and synthesis. In this project, we elaborate a new approach to design survivable systems (in particular computer and communication networks) based on a repeatedly applied analysis of the system to identify various kinds of threats, errors and performance bottlenecks. Our evaluation of a survivable system combines fault-, performance- and security management. In [Benecke 2002] the approach is applied, by way of example, to packet screens as important building blocks of firewalls. Another emphasis of the project is put on the efficient solution of analytical reliability models and their application to communication networks [Heidtmann 2002].

Development and Analysis of Fault Tolerant Distributed Applications Based on "Time Triggered Architecture" for Automotive Environment
- Universidad Politécnica de Valencia, Spain

The objective of this project is to design a brake by wire system based on a Time Triggered Architecture and a bridge for a light control subsystem based on CAN network.

DIRC: Interdisciplinary Research Collaboration in Dependability of Computer-Based Systems
- University of Newcastle, UK
- City University London, UK
- University of Edinburgh, UK
- Lancaster University, UK
- University of York, UK

The EPSRC-funded collaboration addresses the dependability of computer-based systems. Dependability is a deliberately broad term to encompass many facets including reliability, security and availability. The term "computer-based systems" draws attention to the involvement of human participants in most complex systems. Because of the breadth of this view, the interdisciplinary approach will include sociologists and psychologists as well as computer scientists, statisticians etc. The six-year research funding will enable the collaboration to tackle broad and fundamental problems of creating dependable systems.

DISCS (Diversity In Safety Critical Software)
- City University, UK
- University of Newcastle, UK

The DISCS project tackles basic issues of interest to the users of design diversity: builders of fault-tolerant, safety-critical, software-based systems, their customers and the agencies responsible for the evaluation and licensing of such systems. The practical aim is better understanding to support better decision-making.
In the long run, better means of designing fault-tolerant systems will make these less expensive in production and will lessen the uncertainty about the fitness for purpose of the eventual product. Better means of evaluation will allow us to place greater confidence in the reliability and safety of systems, and thus better control the societal risk of critical systems.

The work at CSR at City University has focused on reliability modelling for diverse systems: we have extended previous models in various directions: modelling and assessment of a specific system rather than of an 'average' system, consideration of the fault insertion process and of the effects of project management decisions. The results affect product planning (what reliability gains can be expected from using design diversity), development (what project decisions can best achieve effective diversity) and assessment, acceptance and licensing (how to judge the reliability of a specific diverse system). In addition to the practical support for decision-making about diverse software-based systems, this modelling work improves our understanding of issues of diversity, reliability and common-mode failure in a wider context, with possible practical applications in the many other areas of engineering and organisational studies where these issues arise.

In parallel, CSR at Newcastle have concentrated on structuring methods for diverse design. The DISCS project has also interacted with our DISPO project (with the University of Bristol), supporting the use of diversity for nuclear safety.

DISPO (DIverse Software PrOject)
- City University, UK
Focuses on the use of diversity in nuclear protection, and will improve the practical advice available to the developers and customers of protection systems using diverse redundancy. The objectives of this project are:
This project addresses the problem of obtaining claimable reliability benefits from the use of diverse software based systems. The beneficiaries of this research are nuclear regulators and utility companies.
DISPO-2 (DIverse Software PrOject)
- City University, UK
This project is a follow-up to the successful DISPO project (1997-2000). It builds on many years of successful research on software fault tolerance and diversity at CSR From the viewpoint of safety assessment, we will study the practical application of the mathematical models we have previously produced for assessing failure correlation in diverse systems. From the viewpoint of achieving diversity for safety and reliability, we will advance the understanding of the effects of "diversity-seeking decisions".
Distributed Expert Systems for Process Monitoring and Controlling: Alcoholic Fermentation Application

Universidad Politécnica de Valencia, Spain

The objective of this project is to design a distributed architecture that uses a variation of expert systems called ruled-nets for the control of chemical process systems applied to alcoholic fermentation.

DIT (Dependable Intrusion Tolerance)
- LAAS - CNRS, France

The DIT project is part of the DARPA OASIS program (Organically Assured & Survivable Information Systems). The aim of the project is to develop Internet servers (in particular, Web servers) able to tolerate intrusions (complementarily to accidental faults). The DIT architecture is based on diverse platforms (OS + application software) providing identical contents, under the control of diversified proxies. Error detection mechanisms (content comparison, integrity checks, mutual monitoring by proxies) is completed by EMERALD intrusion detection tools. The redundancy level is automatically adapted according to the current alert level, with graceful performance degradation.

DOTS (Diversity with Off-The-Shelf Components)
- City University, UK
- University of Newcastle, UK

The DOTS project unifies two strands of research in software engineering: design diversity for fault tolerance, and re-use of off-the-shelf software. It builds on previous work on diversity at the Centre for Software Reliability, and in particular on the DISCS project (Diversity In Safety Critical Software). It is motivated by the increasing industrial interest in using off-the-shelf (rather than bespoke) software for building new systems or applications. Its premises are:

In many applications, the main problem with off-the-shelf components is the difficulty of achieving confidence of sufficient reliability;

Software fault tolerance (diversity) is a convenient way of increasing system reliability without changing the internals of software modules;

Software fault-tolerance in the form of modular redundancy with diversity (as in "multiple-version software) becomes affordable and convenient when based on OTS items. This possibility has not been sufficiently studied.

Some methods for increasing the dependability of COTS-based systems (e.g. depending on wrappers with filtering or monitoring functions) are actually other examples of software fault tolerance, but have not been studied as such, e.g. to guide architectural decisions to achieve better reliability.

The general goal of this project is to support decisions both in the acceptance of a system including OTS items and in its development, i.e. in the choice and combination of OTS items, their interconnection and system-level verification.

DRAGON (Database Replication based on Group Communication Primitives)
- Ecole Polytechnique Fédérale de Lausanne, Switzerland

This project aims at designing and implementing a tool to support replication in distributed databases using distributed system concepts (group communication technology), and ensuring replica consistency with good performance.

DSoS (Dependable Systems of Systems)
- INRIA - Rocquencourt, France
- LAAS - CNRS, France
- University of Newcastle, UK
- Technische Universität Wien . Austria
DsoS is a European IST project (IST-1999-11585) that aims to develop significantly improved means for composing a dependable "system of systems" from a set of largely autonomous component computer systems. The project focuses on the design (type, placement, properties) of the interfaces that form the common boundaries between component systems, and the associated validation and dependability assessment activities.
FIT (Fault Injection into Time Triggered Architecture)
- Technische Universität Wien . Austria
- Chalmers University of Technology, Göteborg, Sweden
- Universidad Politécnica de Valencia, Spain

The goal of the project is to validate by means of several fault injection techniques the communication controller of a Time triggered architecture.

High-Security Real Time Distributed System: Mobile Robot Control Application
- Universidad Politécnica de Valencia, Spain
- Universitat Politècnica de Catalunya, Spain

The objective of this project is to develop a distributed fault tolerant architecture for a mobile robot control. This architecture uses a vision and a wireless subsystem and motion control subsystem interconnected by a fibre optics area network (CAN).

Jgroup
- Università di Bologna, Italy

Jgroup is an integration of group technology with distributed objects. Jgroup supports a programming paradigm called object groups that enables development of reliable and highly-available services based on replication.

MAFTIA (Malicious- and Accidental-Fault Tolerance for Internet Applications)
- LAAS - CNRS, France
- Universidade de Lisboa, Portugal
- University of Newcastle, UK
MAFTIA is a European IST project (IST-1999-11583) aimed at investigating the tolerance paradigm in security. Instead of just aiming to prevent intrusions, the aim is to make the overall system secure and operational, even if some subsystems are successfully attacked.
Micro-controller Based Control System for a Laser: Image Capturing Application of the Injection in an Automotive Diesel Engine
- Universidad Politécnica de Valencia, Spain

The objective of this project is to develop a microcontroller and FPGA systems to control a laser system and vision system to take pictures of injection process inside a cylinder in a diesel engine.

MP6 (Security Models and Policies for Healthcare and Social Information and Communication Systems)
- LAAS - CNRS, France
MP6 is a project of the French national RNRT research network. The project aims to analyze security requirements for information systems in healthcare and social sectors, and to develop security policies adapted to these requirements, supported by models able to verify certain properties. Authorization and anonymization problems are of particular interest, and two policy examples will be developed for these two cases.

Neko (A Single Environment to Simulate and Prototype Distributed Algorithms)
- Ecole Polytechnique Fédérale de Lausanne, Switzerland

The goal of the project is to build a highly extensible yet simple and easy to use Java framework for constructing and testing reliable distributed algorithms.

PETERS (Pre-Exploitative Tools for Evaluating Reliability of Software)
- City University, UK

This project is concerned with the development of general techniques for obtaining accurate measures and predictions of the reliability of software. It builds on very successful, and novel, research within the Centre for Software Reliability that now allows certain reliability measures to be accompanied by a guarantee of accuracy, and in very general circumstances allows the reliability predictions from models to be improved in the light of their previous errors.
The work here mainly addresses the problems faced by statistically unsophisticated users of these new advanced statistical approaches: it provides means whereby the power of the techniques can be made accessible to industrial reliability engineers and software engineers.

PRIDE
- CNUCE and IEI (CNR, Pisa), Italy
- University of Florence, Italy
Moreover, the group is currently active in the project PRIDE, funded by the Italian national space agency (ASI), dealing with automatic transformations from system designs in UML to dependability models targeting the most common analysis tools. Also, research on architectural designs and analysis of control systems for embedded real-time applications in the railway field is in progress, inside a cooperation with the most relevant Italian railway company (Ansaldo segnalamento Ferroviarrio).
Ravenspark

University of York, UK

This is an EPSRC/DERA (now Qinetiq) funded project that also involves Praxis and BAE SYSTEMS. The project is concerned with: formal analysis of Ravenscar programs (Ravenscar is a simple subset of Ada95 tasking features) using Model Checking, and linking formalised subsets of the sequential parts of Ada (such as Spark) with Ravenscar.

RIS (Dependability Engineering Network)
- LAAS - CNRS, France

RIS is a cooperative academia-industry network managed by LAAS that aims to share past experience and to stimulate joint working groups on themes dealing with dependability engineering of software-intensive systems.

SHIMA (Integrated Modular Avionics for Small Helicopters)

University of York, UK

This project is funded by the DTI and features partners from: Stewart Hughes, Smiths Industries, and The University of York. The overall goal of the project is to investigate the applicability of IMA technology, which is defined for large-scale avionics systems, in a small helicopter environment. The specific aims are to: design and prototype a high integrity APEX-compliant operating system kernel which supports temporal and spatial fire-walling using the Ravenscar Profile of Ada 95, and demonstrate the prototype using a mixed language (Ada and C) application containing both safety critical and non-critical components.

PURTA (Precise UML for Real-Time Applications)

University of York, UK

Tata Consultancy Services are funding a three year project called PURTA (Precise UML for Real-Time Applications). This project is developing a precise semantic framework for UML that will permit it to be applied to the specification of high integrity real-time systems.

UTC (Rolls-Royce University Technology Centre for Systems and Software Engineering)
- University of York, UK
Rolls-Royce established, in October 1993, a University Technology Centre (UTC) in Systems and Software Engineering; Rolls-Royce also fund an associated project known as ASSET. The work is particularly concerned with the production of electronic engine controllers (EECs) for large civil and military aircraft engines. The current work is in the area of requirements analysis, reuse of specifications and designs, timing and schedulability analysis, safety cases and metrics. ASSET is concerned with the rapid, and cost-effective, development of EEC software, and is producing prototype tools to assist such a process.
Validation of Graphically Elicited Multi-variate Probability Models for Safety Assessment of Computer-based Systems

City University, UK

As in the case of other non-trivial software-based systems, we must assume that failure of safety critical systems is possible, and attempt to discover whether or not, in the case of each system, failure is sufficiently unlikely for the system to be licensed for operational use. There are a number of special features of complex, software-based, safety-critical systems which contribute to the difficulty of assessing their dependability. There is insufficient objective statistical evidence to assure - by testing, or from actual operation of related systems in related environments - that the required level of reliability has been achieved in the case of a new system. For this reason, safety assessors turn to other sources of evidence in an attempt to increase their assurance that such a system is fit for purpose. These other sources might include measurable evidence relating to the quality of the requirements elicitation and design processes, or to the competence of development personnel; as well as the use of more subjective expert assessment of these and other factors. In taking account of much of this evidence, there will not always be scientifically accepted or widely agreed relationships and causal models on which to rely.

This project attempts to investigate the contribution that graphical probability models or "belief networks" might make to these problems. In particular it focuses on:

Examining and comparing the value of alternative graphical formalisms such as Directed Acyclic Graphs, Undirected Graphs and Chain Graphs. In particular we hope that use of these multiple formalisms might increase our assurance that the builders and users of such graphical models understand the "system of conditional independence assumptions" depicted by such graphs as fully and correctly as possible.
The development of automated methods of providing a multiplicity of forms of feedback of the structure, assumptions and consequences of such graphical probability models. We intend that this model feedback should include automatically computed symbolic (as well as numeric) forms of model output. In these ways we aim to develop methods of deepening a safety assessment expert's appreciation of, and interaction with, the formal conditional-independence model expressed by a graphical formalism, and hence gaining confidence that such models, ultimately and after appropriate adjustment, can become a valuable aid to and a fair representation of the coherent beliefs of such experts.

VOSS (Validation of Stochastic Systems)
- RWTH Aachen, Germany
Cooperation project between three German universities (Aachen, Bonn, Erlangen) and two Dutch universities (Nijmegen, Twente) on modelling and validation of, among others, dependable systems. The project is fincanced by the dutch and german science foundation (NWO and DFG). Applicability also lies in networking and real-time systems.

References

[Abdellatif 2001] O.Abdellatif-Kaddour , P.Thévenod-Fosse , H.Waeselynck, Adaptation of simulated annealing to property-oriented testing for sequential problems, 2001 International Conference on Dependable Systems and Networks (DSN'2001). Fast abstracts, Göteborg (Sweden), 1-4 June 2001, pp.B82-B83

[Aidemark 2002] J. Aidemark, J. Vinter, P. Folkesson, J. Karlsson, Experimental Evaluation of Time-redundant Execution for a Brake-by-wire Application, International Conference on Dependable Systems and Networks (DSN-2002), Washington DC, USA, June 2002.

[Arlat 1999] J. Arlat, Y. Crouzet, Y. Deswarte, J.C. Laprie, D. Powell, P. David, J.L. Déga, C. Rabéjac, H. Schindler, J.F. Soucaille, Fault tolerant computing, Encyclopedia of Electrical and Electronic Engineering, Vol.7, Ed. J.G. Webster, Wiley Interscience, ISBN 0471139467, 1999, pp.285-313

[Avizienis 2001] A. Avizienis, J.-C. Laprie and B. Randell, Fundamental Concepts of Dependability. Technical Report 739, pp. 1-21, Department of Computing Science, University of Newcastle upon Tyne, 2001.

[Beder 2001] D.M. Beder, B. Randell, A. Romanovsky and C.M.F. Rubira-Calsavara, On Applying Coordinated Atomic Actions and Dependable Software Architectures for Developing Complex Systems, International Symposium on Object-oriented Real-time Distributed Computing, Margeburg, Germany, May 2001, IEEE, 4, pp. 103-112, 2001.

[Bell 2001] A. Bell, B.R. Haverkort, Serial and parallel out-of-core solution of linear systems arising from generalised stochastic Petri net models, High Performance Computing 2001, Seattle, USA, April 22--26, 2001

[Benecke 2002] C. Benecke. “Überlebensfähige Sicherheitskomponenten für Hochgeschwindigkeits-netze -- Entwurf und Realisierung am Beispiel einer Packet Screen“. Dissertation, Fachbereich Informatik, Univ. Hamburg. Berichte aus dem Forschungsschwerpunkt Telekommunikation und Rechnernetze, Band 3. B.E. Wolfinger (ed.). Shaker-Verlag. Aachen, Germany. 2002.

[Betous-Almeida 2002] C.Betous-Almeida , K.Kanoun, Stepwise construction and refinement of dependability models, 2002 International Conference on Dependable Systems & Networks (DSN'2002), Washington (USA), 23-26 June 2002, pp.515-526

[Bondavalli 2000a] Andrea Bondavalli, Silvano Chiaradonna, Felicita Di Giandomenico, Fabrizio Grandoni, Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults, IEEE Transactions on Computers, 49(3) March 2000, pp. 230-245

[Bondavalli 2000b] A. Bondavalli, I. Mura, S. Chiaradonna, R. Filippini, S. Poli, and F. Sandrini, DEEM: a Tool for the Dependability Modeling and Evaluation of Multiple Phased Systems, International Conference on Dependable Systems and Networks (DSN2000), New York, NY, USA, IEEE Computer Society Press. June 2000, pp. 231-236.

[Bondavalli 2001] A. Bondavalli, M. Dal Cin, D. Latella, I. Majzik, A. Pataricza and G. Savoia, Dependability Analysis in the Early Phases of UML Based System Design, Journal of Computer Systems Science and Engineering, Vol. 16, pp. 265-275, 2001.

[Buchacker 2001] K. Buchacker, V. Sieh, Framework for Testing the Fault-Tolerance of Systems Including OS and Network Aspects.", High-Assurance System Engineering Symposium (HASE 2001), IEEE, Boca Raton, Florida, 2001, pp. 95-105.

[Carreira 1998] Joao Carreira, Henrique Madeira, João Gabriel Silva, Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers, IEEE Transactions on Software Engineering, 24(2): 125-136 (1998)

[Casimiro 2002] António Casimiro, Paulo Veríssimo, Generic Timing Fault Tolerance using a Timely Computing Base, International Conference on Dependable Systems and Networks (DSN 2002), Washington D.C., USA, June 2002

[Chevalley 2001a] P. Chevalley, P. Thévenod-Fosse, An empirical evaluation of statistical testing designed from UML state diagrams : the flight guidance system case study, 12th International Symposium on Software Reliability Engineering (ISSRE'2001), Hong Kong, 27-30 November 2001, pp.254-263

[Chevalley 2001b] P. Chevalley, Applying mutation analysis for object-oriented programs using a reflective approach, 8th Asia-Pacific Software Engineering Conference (APSEC 2001), Macau (Chine), 4-7 December 2001, pp.267-270

[Correia 2002] Miguel Correia, Paulo Veríssimo, Nuno Ferreira Neves, The Design of a COTS Real-Time Distributed Security Kernel, 4th European Dependable Computing Conference, Toulouse, France, October 2002

[Cukier 1999] M. Cukier, D. Powell, J. Arlat, Coverage estimation methods for stratified fault-injection, IEEE Transactions on Computers, Vol.48, N°7, pp.707-723, July 1999

[de Lemos 2002] R. de Lemos, C. Gacek, A. Romanovsky, Tolerating Architectural Mismatches, ICSE Workshop on Architecting Dependable Systems, May 2002, Orlando, FL, USA.

[Dearden 2000] A. Dearden, M. Harrison and P. Wright, "Allocation of Function: Scenarios, Context and the Economics of Effort", International Journal of Human-Computer Studies, 52, pp.289-318, 2000.

[Deswarte 1991] Y. Deswarte, L. Blain and J.-C. Fabre, "Intrusion Tolerance in Distributed Systems", in Symp on Research in Security and Privacy, (Oakland, CA, USA), pp.110-21, IEEE Computer Society Press, 1991.

[Deswarte 2001] Y. Deswarte, N. Abghour, V. Nicomette, D. Powell, An Internet authorization scheme using smart card-based security kernels, International Conference on Research in Smart Cards (E-smart 2001), Cannes (France), 19-21 septembre 2001

[Dobson 1986] J. E. Dobson and B. Randell, "Building Reliable Secure Systems out of Unreliable Insecure Components", in Conf on Security and Privacy, (Oakland, CA, USA), pp.187-93, IEEE Computer Society Press, 1986.

[Elmenreich 2002] Wilfried Elmenreich, Philipp Peti , Achieving Dependability in Time-Triggered Networks by Sensor Fusion, 6th IEEE International Conference on Intelligent Engineering Systems (INES), May 2002, Opatija, Croatia

[Essamé 1999] D. Essamé, J. Arlat, D. Powell, PADRE : A Protocol for Asymmetric Duplex Redundancy, 7th IFIP International Working Conference on Dependable Computing for Critical Applications (DCCA-7), San Jose (USA), 6-8 January 1999, pp.213-232

[Fabre 1999] J.C. Fabre, F. Salles, M. Rodriguez, J. Arlat, Assessment of COTS microkernels by fault injection, 7th IFIP International Working Conference on Dependable Computing for Critical Applications (DCCA-7), San Jose (USA), 6-8 January 1999, pp.19-38

[Fota 1999] N. Fota, M. Kaâniche, K. Kanoun, Incremental approach for building stochastic Petri nets for dependability modeling, Statistical and Probabilistic Models in Reliability, Eds. D.S. Ionescu, N. Limnios, ISBN 0-8176-4068-1, Birkhauser, 1999, pp. 321-335

[Fraga 1985] J. Fraga and D. Powell, "A Fault and Intrusion-Tolerant File System", in IFIP 3rd Int Conf on Computer Security, (J. B. Grimson and H.-J. Kugler, Eds.), (Dublin, Ireland), Computer Security, pp.203-18, Elsevier Science Publishers B.V. (North-Holland), 1985.

[Haverkort 2001] B.R. Haverkort, R. Harper, Performance and Dependability Modelling Techniques and Tools, special issue of Performance Evaluation, Volume 44, Issues 1-4, 2001

[Haverkort 2002] B.R. Haverkort, L. Cloth, H. Hermanns, J.P. Katoen, C. Baier, Model checking performability properties, IEEE Int'l Conference on Dependable Systems and Networks, June 2002 , pp.103-112

[Heidtmann 2002] K. Heidtmann .“Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms for Reliability and Safety Evaluation“. Proceedings of the International Conference on Computer Safety, Reliability and Security (SAFECOMP 2002). Catania, Italy. LNCS. Springer. Berlin, Germany. September 2002.

[Höxer 2002] H.-J. Höxer, V. Sieh, V., K. Buchacker, UMLinux - A Tool for Testing a Linux System's Fault Tolerance, LinuxTag 2002, Karlsruhe, Germany, June 6.-9. 2002.

[Kaâniche 2001] M. Kaâniche, K. Kanoun, M. Rabah, A framework for modeling availability of e-business systems, 10th IEEE International Conference on Computer Communications and Networks (IC3N'2001), Scottdale (USA), 15-17 octobre 2001, pp.40-45

[Kanoun 1999] K. Kanoun, M. Borrel, T. Morteveille, A. Peytavin, Availability of CAUTRA, a subset of the French air traffic control system, IEEE Transactions on Computers, Vol.48, N°5, pp.528-535, mai 1999

[Kanoun 2002] K.Kanoun , H.Madeira , J.Arlat, A preliminary framework for dependability benchmarking, Workshop on Dependability Benchmarking, Washington (USA), Supplement to 2002 International Conference on Dependable Systems and Networks (DSN 2002), 23-26 June 2002, pp.F.7-F.8

[Killijian 2000] M.O. Killijian, J.C. Fabre, Implementing a reflective fault-tolerant CORBA system, 19th IEEE Symposium on Reliable Distributed Systems (SRDS 2000), Nuremberg (Germany), 16-18 October 2000, pp.154-163

[Littlewood 2000] B. Littlewood, P. Popov, L.Strigini, Assessment of the Reliability of Fault-Tolerant Software: a Bayesian Approach, 19th International Conference on Computer Safety, Reliability and Security (SAFECOMP'2000), Rotterdam, the Netherlands, Springer, 2000

[Marsden 2001] E.Marsden, J.C.Fabre, Failure mode analysis of CORBA service implantations, IFIP/ACM International Conference on Distributed Systems Platforms, Heidelberg (Germany), 12-16 November 2001

[Mersiol 2002] M.Mersiol, J.Arlat, D.Powell, A.Saidane, H.Waeselynck, C.Mazet, FAST: a prototype tool for supporting the engineering of socio-technical systems, 3rd European Systems Engineering Conference, Toulouse (France), 21-24 Mai 2002, pp.33-40

[Montresor 2002] A. Montresor, H. Meling, O. Babaoglu, Towards Adaptive, Resilient and Self-Organizing Peer-to-Peer Systems. Proceedings of 1st International Workshop on Peer-to-Peer Computing, Pisa, Italy, May 2002

[Mostéfaoui 2001] A. Mostéfaoui, S. Rajsbaum, M. Raynal:, Conditions on input vectors for consensus solvability in asynchronous distributed systems, 33rd Annual ACM Symposium on Theory of Computing, July 6-8, 2001, Heraklion, Crete, Greece. ACM, 200

[Mura 2001] I. Mura, A. Bondavalli, Markov Regenerative Stochastic Petri Nets to Model and Evaluate the Dependability of Phased Missions, IEEE Transactions on Computers, 50 (12) December 2001, pp.1337-1351

[Pinho 2001] L. Pinho, F. Vasques. Timing Analysis of Reliable Real-Time Communication in CAN Networks, 13th Euromicro Conference on Real-Time Systems,. Delft, Netherlands. June 2001. pp. 103-112.

[Pinho 2002] L. Pinho, F. Vasques. Transparent Environment for Replicated Ravenscar Applications, 7th International Conference on Reliable Software Technologies - Ada-Europe 2001, Vienna, Austria. June 2002.

[Porcarelli 2002a] S. Porcarelli, F. Di Giandomenico, and A. Bondavalli. "Analyzing Quality of Service of GPRS Network Systems from a User's Perspective". Proceedings of the IEEE Symposium on Computers and Communications (ISCC 02) (to appear).Taormina, Italy. July 2002.

[Porcarelli 2002b] S. Porcarelli, and F. Di Giandomenico."On the Effects of Outages on the QoS of GPRS Networks under different User Characterizations". Proceedings of the 4th European Dependable Computing Conference (EDCC-4) (to appear). Toulouse, France. October 2002.

[Powell 1999] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, P. Coppola, A. Fantechi, E. Jenn, C. Rabejac, A. Wellings, GUARDS : a generic upgradable architecture for real-time dependable systems, IEEE Transactions on Parallel and Distributed Systems, Vol.10, N°6, pp.580-599, June 1999

[Powell 2001] D. Powell (Ed.), A Generic Fault-Tolerant Architecture for Real-Time Dependable Systems, Kluwer Academic Publishers, N°ISBN 0-7923-7295-6, 2001, 242p.

[Powell 2002] D. Powell, "Carnasie Line, the 'French Touch' under Broadway - Safety of the New York subway", CNRS Info, no. 401, pp.17-8, 2002 (in French).

[Rodriguez 2000] M. Rodriguez, J.C. Fabre, J. Arlat, Formal specification for building robust real-time microkernels, 21st IEEE Real-Time Systems Symposium (RTSS2000), Orlando (USA), 27-30 November 2000, pp.119-128

[Rodriguez 2002] M.Rodriguez, A.Albinet, J.Arlat, MAFALDA-RT: a tool for dependability assessment of real-time systems, 2002 International Conference on Dependable Systems & Networks (DSN'2002), Washington (USA), 23-26 June 2002, pp.267-272

[Ruiz-Garcia 2001] J.C. Ruiz-Garcia, P. Thévenod-Fosse, J.C. Fabre, A strategy for testing MetaObject Protocols in reflective architectures, 2001 International Conference on Dependable Systems and Networks (DSN'2001), Göteborg (Sweden), 1-4 July 2001, pp.327-336

[Simache 2001] C. Simache, M. Kaâniche, Measurement-based availability analysis of Unix systems in a distributed environment, 12th International Symposium on Software Reliability Engineering (ISSRE'2001), Hong Kong, 27-30 November 2001, pp.346-355

[Steiner 2002] Wilfried Steiner, Michael Paulitsch, The Transition from Asynchronous to Synchronous System Operation: An Approach for Distributed Fault-Tolerant Systems, International Conference on Distributed Computing Systems (ICDCS 2002), Vienna, Austria, July 2-5, 2002.

[Tartanoglu 2002] Ferda Tartanoglu, Valerie Issarny, Alexander Romanovsky, Nicole Levy, Dependability in the Web Service Architecture, ICSE Workshop on Architecting Dependable Systems, May 2002, Orlando, FL, USA.

[Tataranni 2001] F. Tataranni, S. Porcarelli, F. Di Giandomenico, and A. Bondavalli."Analysis of the Effects of Outages on the Quality of Service of GPRS Network Systems". Proceedings of the International Conference on Dependable Systems and Networks (DSN2001). Goteborg, Sweden. IEEE Computer Society Press. June 2001. pp.235-244.

[Veríssimo 2002] Paulo Veríssimo, António Casimiro, The Timely Computing Base Model and Architecture, IEEE Transactions on Computers, Special Issue on Asynchronous Real-Time Systems, vol. 51, n. 8, Aug 2002

[Zarras 2001] Apostolos Zarras, Valérie Issarny, Automating the Performance and Reliability Analysis of Enterprise Information Systems, 16th IEEE International Conference on Automated Engineering (ASE2001), pages 350-354, November, 2001, San Diego CA, USA. Lecture Notes in Computer Science 2218, Middleware 2001, Springer, ISBN 3-540-42800-3, 2001, pp.216-231

Maintained by Rogério de Lemos (r.delemos@ukc.ac.uk)
Last updated 4 November, 2002

5. Dependable Systems

Ongoing Research / Future Directions

Fault Prevention

Fault Tolerance

Fault Removal

Fault Forecasting

CaberNet Related Activities

AS23 (Advanced Testing Techniques for Complex Systems)

BAE SYSTEMS Systems Integration Consortium

CAUTION++ (Capacity and Network Management Platform for increased Utilisation of Wireless Systems of Next Generation++)

Design and Realization of Survivable Computer Systems and Networks

DIT (Dependable Intrusion Tolerance)

PRIDE

Ravenspark

SHIMA (Integrated Modular Avionics for Small Helicopters)

PURTA (Precise UML for Real-Time Applications)

References