The key to fault prevention is rigorous design, i.e., the adoption of design practices that eliminate particular classes of error. For example, by adopting appropriate design practices one can avoid deadlock in distributed systems. The term “formal methods” applies to the use of specifications that are completely mathematical and justifications of design and code whose proofs (that they satisfy their specifications) are checked by machine. Although there is a long history of such research, such complete formality is more appropriate for safety-critical applications; given today's tools, complete formality is not cost-effective for most applications. Rigorous methods are those “informed by formality”. Often the formality informing rigorous methods takes the form of a mathematical model that is adequate to justify the soundness of the methods.
Areas that can be identified as future directions of research into rigorous design methods include:
Model Driven Development: There is a need for formal models that better capture features of system that are key to dependability. For example, openness demands an approach to modelling trust, confidence and knowledge in the presence of malicious parties; mobility demands models of resource consumption, real-time and real space; adaptability demands models of evolving operating environments and emergent properties of massively distributed systems; workability is concerned with modelling human activity, system configuration and the ability of an operator to observe the system. All of these, more or less formal, models inspire approaches to fault prevention (and fault removal) that will target particular classes of fault.
Influencing Current Design Practice: Rigorous design is a pragmatic activity; its primary aim is to package formal insights in a way that makes them applicable in a real-world design context. A particularly important aspect of these methods is their scalability and the search for compositional methods that allow scalability through separation of concerns. Most CaberNet teams involved in rigorous design technologies are concerned to influence industrial design practice. Approaches here include influencing standardisation bodies and providing better formal underpinnings to existing design notations and practices. Here, many teams have current activities to influence design in UML that could serve as a route to disseminate rigorous approaches.
Tool Definition and Development: The more rigorous methods can be supported by formal tools, the more likely they are to be adopted and the more likely they are to be correctly applied in practice. There is a plethora of formal methods tools in Europe and the US. We need to define core toolsets for particular classes of application and to enable inter-working across tools. In particular, we need to develop tools that aid the development of dependable information infrastructures and of approaches to moving many more rigorous approaches towards practical use.
Fault tolerance
Fault-tolerance through self-healing and self-protection constitutes a major technology for building systems that can be depended on, for both life-critical and business-critical applications. With respect to the openness, mobility, adaptability, and workability aspects of AmI, we can identify the following research challenges.
Openness. Fault-tolerance should be able to provide solutions to the dependability challenges of open networked environments, which include: risks to availability and even safety due to complexity, hidden interdependencies and malicious attacks; compromises to confidentiality and privacy due to intrusions and insider malfeasance. In particular, we need to investigate:
Innovative protocols for solving basic distributed system problems (e.g., information dissemination, consensus, election, group membership…) in the presence of uncertain delays, accidental faults and malicious attacks.
Scalable fault-tolerance algorithms that can ensure availability and integrity of computation and data in highly asynchronous and uncertain environments (e.g., for peer-to-peer systems).
New distributed architectures capable of detecting and/or tolerating malicious intrusions with a view to attaining acceptable availability and confidentiality under attack.
Mobility. Several distinctive features of mobile systems offer significant fault-tolerance challenges:
Range limitation, obstacles to signal propagation, power source depletion, voluntary disconnections… all mean that network partitioning is common rather than exceptional, so there is a need to re-visit existing fault-tolerance paradigms and consider not only asynchrony, but also extreme models with permanent partitioning in which liveness properties may be difficult or even impossible to define. The migration from connection-centric to data-centric interaction should favour data replication and peer-to-peer tolerance paradigms.
Compared to fixed networks, there are limited power sources, higher error rates and lower bandwidths, which mean that new fault-tolerance paradigms need to be investigated with delicate trade-offs between computation and communication.
The susceptibility of wireless communication to eavesdropping, the absence of guaranteed connectivity to trusted third parties for authentication and authorization, the susceptibility of unattended ubiquitous devices to tampering all mean that security issues are of utmost importance. The intrusion tolerance paradigm in such a setting offers an interesting research direction.
Location-awareness, while opening interesting application perspectives, also raises frightening privacy issues that again might find solutions in tolerance paradigms based on ephemeral pseudonyms.
New cooperative peer-to-peer models need to be defined and studied from a tolerance perspective, including for example, distributed trust models based on reputation and mutual confidence, micro-economy models for service-sharing, etc.
Adaptability. Self-healing and self-protection are of course forms of adaptation, in that they aim to modify the structure or the control of a system in order to react to the discovery of faults. However, challenging topics linked specifically with adaptation can be identified:
Adaptation of fault-tolerance techniques to the current level of threat (e.g., adaptive replication, adaptive encryption…) and the means to carry out such adaptation (e.g., reflective systems, aspect-oriented programming, exception-handling…)
Use of adaptation techniques for self-optimisation and self-management of complex systems in the face of a wide variety of accidental and malicious threats.
Tolerance of deficiencies in adaptation algorithms, especially those based on heuristics or on learning.
Workability. There is also a need to consider techniques for tolerating classes of human-related faults that have hitherto received little attention:
Configuration faults: it is difficult to configure complex systems, especially in the face of changing environments and requirements. Such faults can lead to catastrophic failures [METT 1993] and are a major source of vulnerabilities that could be exploited for malicious attack [Whitten and Tikkar 1999]. Techniques are needed for discovering and correcting such faults on-line, and adapting other configuration elements accordingly.
Data faults: the quality and the consistency of data are fundamental in critical data-intensive systems. We need techniques that can tolerate poor data through identification and correction procedures, or possibly by masking techniques that exploit data redundancy. Human agents are seen as a privileged approach, since humans are often better than machines at identifying semantic discrepancies, possibly through associated metadata.
Observation faults: inadequate system observability is a major source of operator mistakes (which can in turn lead to configuration and data faults). Hence there is a need to explore the tolerance paradigm as a complement to traditional prevention techniques, exploiting the capacity of humans to adapt to degraded levels of observability and to aggregate multiple sources of disparate information. These latter issues have been identified in the literature [Hollnagel 1987] and now constitute a strong basis for future research.