2.2 Composite vs Reactive

The following analysis evaluates the advantages and disadvantages of the two proposed fail-safe system topologies:

  1. Composite Fail-Safe:

    Pros

    1. Inherent Architecture Fail-Safe and Redundancy: When multiple branches handle the same safety function, if a path is affected by a fault, the rest of the system might distribute stress across remaining paths and it can operate safely but in controlled degradation. That comes with a caveat: the minimum number of parallel instances needs to be three to ensure, besides voting, the needed redundancy (NooM - N out-of M, where N < M topologies).
    2. Predictable and Gradual Failure: As the system remains operational, dangerous failures occur gradually rather than abruptly. This gradual failure allows for a remote manual or automatic shutdown when the system is detected to be operating in degraded conditions. This scenario occurs when more than three redundant paths (replicas) are used, and the voting requires only two replicas to agree. In summary, if one replica is damaged, the system can continue the operation as 2oo2.
    3. Parallel Instances Operate the Same Software: Since the branches are identical, the same processing software can operate in all subsystems and the redundancy and decision logic can be achieved eventually in hardware.
    4. Robustness to Multiple Failures: The distributed and redundant system can handle multiple consecutive failures with the ability to detect those failures and set the system to a safe state.
    5. Permissive Mode Mismatch Minimization: Once the processing results discrepancy between the paths occur, the system can promptly decide whether to continue functioning in a reduced operation mode if a sufficient number of branches remain in good condition, or to shut down to a safe state.

    Cons

    1. Design Complexity: Modeling the system is complex and validating the system can be very challenging. The cross-interactions between the processing paths require extensive simulation and testing to confirm the system topology viability as the number of possible failure modes increase with its complexity.
    2. Lack of Diversity: To simplify the developing process, all branches are made identical. This approach results in running software replica instances, which makes the system prone to systematic failures.
    3. Complexity of Synchronization: The system can run complex software with unpredictable events triggered by external factors. This leads to a complex circuitry which compares the processing branches outputs at certain moments in time. The miss-synchronizations need to be well managed to avoid the voter block decision mismatch. Additionally, if inter-branches insulation did not allow a higher degree of independence (that increases in a certain degree the prospect of common mode of failure), a time delay technique shall be used for temporal diversification which complicates the implemented decision logic.
    4. Increased Cost Due to Over-Engineering: Redundancy adds additional cost, increases the system size and weight and induces performance penalties during normal operation by slowing the response due to the need for synchronization.
    5. Resource Intensive: The complexity of additional circuitry required to manage redundancy complicates system testing and increases maintenance cost.
    6. Lack of Modularization: Typically, the design is tailored around the safety function that needs to be accomplished, making it more difficult to reuse parts of the design.
  2. Reactive Fail-Safe:

    Pros

    1. No Performance Penalties under Normal Operating Conditions: Safety measures are activated only when required. Reactive subsystems monitor the health and indirectly oversee operating conditions. The main subsystem that executes the safety function runs only a small additional function (crosschecks) besides the main operation.
    2. Lower Overhead in Nominal Conditions: The reactive side requires fewer resources, as primarily monitors the main subsystem's operating condition and to reacts only if a fault is detected. This results in an asymmetrical system design, with less hardware block replication, making it more flexible and cost-effective.
    3. Adaptability: The reactive side can be easily adjusted to detect other possible malfunctions that initially were not planned for the system.
    4. Hardware Diversity: Because the processing power demand is not equivalent for the main and reactive subsystems, the architecture can use different hardware for main and reactive processing paths. This reduces the chances of systematic failure, as there is no commonality in operation for the two subsystems.
    5. No Synchronization is Required: Reactive subsystems should not operate in sync with the main subsystems due to their differing roles.

    Cons

    1. Dependence on Detection and Control Reliability: The safety strategy relies on the accuracy of sensors and control algorithms. On the reactive side, as the measurements are indirect, the hardware degradation can impact the overall system functionality, potentially allowing dangerous conditions to go unnoticed. Therefore, the additional circuitry required to check operational conditions may increase total system cost.
    2. Delay in Response/Temporary Permissive State is Allowed: Even when the fault detection remains reliable, there is always a finite response of the systems between the detection of the fault and the activation of the fail state and temporarily the permissive state will be allowed. This might not work for systems that require critical timing to switch to a safe state.

By analyzing both approaches it is displayed that for Composite Fail-Safe, the pros 1a to 1d are valid only in voting types of configurations that allow redundancy: such as 2oo3, 2oo4, etc. However, due to the complexity of those types, multi-redundancy approaches are not frequently adopted as industry prefers multiple systems in parallel: e.g., 2 x 2oo2 that can be easily justified as SIL 4-compliant.

Additionally, achieving time synchronization on the voting side is challenging and becomes increasingly complex when critical timing is required. A significant advantage is that once an error appears, the output systems can either enter a safe state or continue operating in controlled degradation until the error is resolved, owing to the simplicity of the voting mechanism.

The major disadvantages include a lack of diversity, which makes the system susceptible to systematic failures, difficulty in achieving modularity as the system must be tailored to a specific set of safety functions and the increased cost of over-engineering due to the necessity for crosschecks and synchronization.

Alternatively, the reactive approach ensures modularity and diverseness. However, from the moment an error is detected until the error is mitigated, the output remains in permissive states. For most of the applications this is not a problem because:

  1. When the load is disconnected, the detection of an internal fault is inconsequential, as the load is already in a safe state.
  2. During the operation, once the fault is detected, if the main system irreversibly failed, the reactive side would place the system in safe state within PST.

Microchip offers two types of ASIL B compliant MCU device variants: one is the fast dsPIC33A DSC with powerful real-time processing capabilities, and the cost-effective SD AVR. Both devices benefit from the same development environment (MPLAB®) and different libraries developed by different teams ensures diversity. An e-Fuse case study is presented, followed by a generalization through the modularization of reactive topology based on the MCU’s devices. This approach can be replicated in projects that are required to comply with ASIL C/D or SIL 3/4.

Note: ISO61508 mandates redundancy–two channel implementation for SIL 4 systems.

This proposal allows smooth and fast development of such products by benefiting from associated available safety packs, generalized Latent Failure Detection (LFD), cross-check libraries and additionally-required drivers.