1 Department of Photonics Engineering, Technical University of Denmark2 Networks, Department of Photonics Engineering, Technical University of Denmark3 Center for Bachelor of Engineering Studies, Technical University of Denmark4 Department of Telecommunication, Technical University of Denmark5 Copenhagen Center for Health Technology, Center, Technical University of Denmark
High availability is a key requirement in mobile communication systems, especially, when it is used for mission-critical services such as public safety e.g. police, ambulance and fire services. A failure in the fixed network infrastructure that provides services to mobile users can affect a large number of users and risk loss of lives. The fixed infrastructure of mobile communication system has different characteristics, for example, architecture ´complexity, real-time peer-topeer communication and performance requirements that make the already existing failure recovery techniques, such as those using rollback or replication techniques inapplicable. This dissertation presents a novel failure recovery approach based on a behavioral model of the communication protocols. The new recovery method is able to deal with software and hardware faults and is particularly suitable for mobile communications infrastructure. The method enables the faulty applications in the infrastructure to quickly and effectively resume their services to their mobile clients with no or minimal loss of work after failure. In our approach, we do not assume a specific fault behavior for example failstop or transient behavior as it is the case for many recovery techniques. In addition, the method does not require any modification to mobile clients. The Communicating Extended Finite State Machine (CEFSM) is used to model the behavior of the infrastructure applications. The model based recovery scheme is integrated in the application and uses the client/server model to save the application state information during failure-free execution on a stable storage and retrieve them when needed during recovery. When and what information to be saved/retrieved is determined by the behavioral model of the application. To practically evaluate and demonstrate the effectiveness of our method, we developed as a case study an experimental testbed for the TETRA (TErrestrial Trunked Radio) packet data network. The testbed works as a distributed system and can run various communication scenarios between the fixed network infrastructure and its mobile users. We thoroughly followed the TETRA standard specifications in our implementation of the communication protocols in order to get a testbed system that operates as the real system with respect to message exchange and timing. The experimental results showed that by using our method the faulty infrastructure application can immediately resume its service after its restart and in less than a minute, it restores its service performance level prior to the failure. The failure-free overhead incurred by the method is relatively low, and is experimentally found to be less than 5% in the conducted experiments.