Page 88 - Mobile Computing
P. 88
87
Fault Tolerance
Fault Tolerance simply means a system’s ability to continue operating
uninterrupted despite the failure of one or more of its components. This is true
whether it is a computer system, a cloud cluster, a network, or something else. In
other words, fault tolerance refers to how an operating system (OS) responds to
and allows for software or hardware malfunctions and failures.
An OS’s ability to recover and tolerate faults without failing can be handled by
hardware, software, or a combined solution leveraging load balancers (see more
below). Some computer systems use multiple duplicate fault tolerant systems to
handle faults gracefully. This is called a fault tolerant network.
Web Server
Web Application
Load Balancer
Standby Web Server
What is Fault Tolerance?
The goal of fault tolerant computer systems is to ensure business continuity and
high availability by preventing disruptions arising from a single point of failure.
Fault tolerance solutions therefore tend to focus most on mission-critical
applications or systems.
Fault tolerant computing may include several levels of tolerance:
At the lowest level, the ability to respond to a power failure, for example.
A step up: during a system failure, the ability to use a backup system
immediately.