Fault Tolerance
Techniques
Dr.J.Nandhini
Professor / ECE
Jai Shriram Engineering College
Tirupur
Fault Tolerance
• Fault Tolerant – able to continue operating despite
the failure of a limited subset of their hardware or
software.
• Robust – reduce the number of faults that the
system will encounter
• Careful specification and design
• Use high grade components to reduce the
manufacturing errors
• Extensive system wide testing
• Short term failure – quickly correcting for a failure
to allow immediate deadlines to be met.
• Long term response – locating the failure,
determining the best response to it and initiating a
recovery and reconfiguration procedure.
• Fault tolerance- ability of a system to
maintain its functionality, even in the
presence of faults.
• Fault – defect or flow that occurs in some
hardware or software component
• Error – manifestation of a fault
• Failure – departure of a system from the
service required.
Types of Fault
Can be classified as
1) Hardware fault
2) Software fault
Hardware Fault – some physical defect that
can cause a component to malfunction.
Eg: A broken wire or the output of a logic
gate
Software Fault – software fault is a bug that
can cause the program to fail for a given set
of inputs
• Fault Latency – duration between the onset
of a fault and its manifestation as an error.
• Error Recovery – is the process by which the
system attempts to recover from the effects
of an error.
• There are 2 forms of error recovery
• Forward error recovery – the system is rolled
back to a moment in time before the error is
believed to have occurred and the
computation is carried out again.
• Backward error recovery – uses time
redundancy since it consumes additional time
to mask the effects of failure
Cause of Failure
• Errors in the specification or design
• Defects in the components
• Environmental effects
Errors in the specification or design
• Mistake in the specification and design are
very difficult to guard against
• Hardware and software failures
• Very difficult to ensure the specification is
right
• Specifications can be reviewed by persons
unconnected with writing them
Defects in the Components
• Hardware components can develop
defects
• These may arise from manufacturing
defects or from defects caused by the
wear and tear of use.
Environmental effects
• Poor ventilation or excessively high
ambient temperatures can melt
components or otherwise damage them
• If computer is in a missile, it can undergo
high g-forces and vibrational stress
Fault types
• Faults are classified according to their
temporal behavior and output behavior
• Temporal behavior classification
1) Permanent
2) Intermittent
3) Transient
• Permanent fault : does not die away with
time but remains until it is repaired or the
affected unit is replaced
• Intermittent fault: cycles between the fault
active and fault benign states (loose wires)
• Transient fault: dies away after some time.
Hard to catch.
Temporal behavior
Output behavior
• A fault is characterized by the nature of the
errors that it generates.
• Two categories of output behavior:
• Non malicious
• Malicious
Fault detection
• Two ways to determine that a processor is
malfunctioning
i) Online detection
ii) Offline detection
• Online detection: goes on in parallel with
normal system operation
• One way of doing this is to check for any
behavior that is inconsistent with correct
option.
• Eg: branching to an invalid destination
• Fetching an illegal opcode
• Inactive for more than a prescribed period
• Watch dog and multiple processors
• Offline detection
• Consists of running diagnostic tests. When
a processor is running such a test, it
obviously cannot be executing the
applications software.
• Diagnostic tests can be scheduled just like
ordinary tests.
Redundancy
• Fault tolerance consists of using and
properly managing redundancy
• Hardware redundancy
• Software redundancy
• Time redundancy
• Information redundancy

Fault tolerance techniques

  • 1.
    Fault Tolerance Techniques Dr.J.Nandhini Professor /ECE Jai Shriram Engineering College Tirupur
  • 2.
    Fault Tolerance • FaultTolerant – able to continue operating despite the failure of a limited subset of their hardware or software. • Robust – reduce the number of faults that the system will encounter • Careful specification and design • Use high grade components to reduce the manufacturing errors • Extensive system wide testing • Short term failure – quickly correcting for a failure to allow immediate deadlines to be met. • Long term response – locating the failure, determining the best response to it and initiating a recovery and reconfiguration procedure.
  • 3.
    • Fault tolerance-ability of a system to maintain its functionality, even in the presence of faults. • Fault – defect or flow that occurs in some hardware or software component • Error – manifestation of a fault • Failure – departure of a system from the service required.
  • 4.
    Types of Fault Canbe classified as 1) Hardware fault 2) Software fault Hardware Fault – some physical defect that can cause a component to malfunction. Eg: A broken wire or the output of a logic gate Software Fault – software fault is a bug that can cause the program to fail for a given set of inputs
  • 5.
    • Fault Latency– duration between the onset of a fault and its manifestation as an error. • Error Recovery – is the process by which the system attempts to recover from the effects of an error. • There are 2 forms of error recovery • Forward error recovery – the system is rolled back to a moment in time before the error is believed to have occurred and the computation is carried out again. • Backward error recovery – uses time redundancy since it consumes additional time to mask the effects of failure
  • 6.
    Cause of Failure •Errors in the specification or design • Defects in the components • Environmental effects
  • 7.
    Errors in thespecification or design • Mistake in the specification and design are very difficult to guard against • Hardware and software failures • Very difficult to ensure the specification is right • Specifications can be reviewed by persons unconnected with writing them
  • 8.
    Defects in theComponents • Hardware components can develop defects • These may arise from manufacturing defects or from defects caused by the wear and tear of use.
  • 9.
    Environmental effects • Poorventilation or excessively high ambient temperatures can melt components or otherwise damage them • If computer is in a missile, it can undergo high g-forces and vibrational stress
  • 10.
    Fault types • Faultsare classified according to their temporal behavior and output behavior • Temporal behavior classification 1) Permanent 2) Intermittent 3) Transient
  • 11.
    • Permanent fault: does not die away with time but remains until it is repaired or the affected unit is replaced • Intermittent fault: cycles between the fault active and fault benign states (loose wires) • Transient fault: dies away after some time. Hard to catch. Temporal behavior
  • 12.
    Output behavior • Afault is characterized by the nature of the errors that it generates. • Two categories of output behavior: • Non malicious • Malicious
  • 13.
    Fault detection • Twoways to determine that a processor is malfunctioning i) Online detection ii) Offline detection
  • 14.
    • Online detection:goes on in parallel with normal system operation • One way of doing this is to check for any behavior that is inconsistent with correct option. • Eg: branching to an invalid destination • Fetching an illegal opcode • Inactive for more than a prescribed period • Watch dog and multiple processors
  • 15.
    • Offline detection •Consists of running diagnostic tests. When a processor is running such a test, it obviously cannot be executing the applications software. • Diagnostic tests can be scheduled just like ordinary tests.
  • 16.
    Redundancy • Fault toleranceconsists of using and properly managing redundancy • Hardware redundancy • Software redundancy • Time redundancy • Information redundancy