Implementing functionally safe RTD systems: Certification - Embedded.com

2022-10-09 04:14:54 By : Mr. Jack Shen

This second article in this two-part series continues our discussion introduced in the first article about resistance temperature detector (RTD) circuit design for a functionally safe system and the Route 2S component certification process. Certifying a system is a long process as all components in the system must be reviewed for potential failure mechanisms and there are various methods to diagnose failures.

The AD7124-4/AD7124-8 are not SIL rated, meaning that they are not designed and developed using development guidelines as per IEC 61508 standard. However, by understanding the end application and usage of various diagnostics, one can assess the AD7124-4/AD7124-8 for usage in a SIL rated design.

Let’s review some of the concepts important to the certification journey:

Systematic failures are deterministic (nonrandom) failures from a certain cause, which can be eliminated by a modification of the design or of the manufacturing process, operational procedures, documentation, or other relevant factors. For example, a noisy interrupt to the system happens due to a lack of filtering on the external interrupt pin.

On the other hand, random failures are due to physical causes, which apply to hardware components within a system. This type of fault is caused by effects such as corrosion, thermal stressing as well as wear-out and it is not possible to catch such failures by systematic processes.

To deal with random failures, we can use methods like reliability, diagnostics, and redundancy.

In reliability, we ensure usage of reliable components, while with diagnostics we make sure that these failures can be detected and corrected. Another way to ensure reliability is to add redundancy to lower the probability of failure but then we increase the system cost and space.

There are four types of random failures, which are safe detected, safe undetected, dangerous detected, and dangerous undetected.

For example, consider a system whose safety function is to open up a power switch for the machine when the temperature read is high. Any random failure that does not impact the safety function, that is, opening up the power switch, is termed a safe detected or a safe undetected failure. The other malfunctions impacting the safety function are dangerous failures. The most important one for us is the dangerous undetected failure. This failure type is the one not covered by diagnostics, so our goal is to increase the diagnostics to keep the dangerous undetected failures minimal.

Random failures can be detected by having various built-in detection mechanisms in the form of software or hardware. For example, a failure in a MOSFET switch can be detected by reading back the output or a random memory bit flip can be detected by running CRC memory checks on regular intervals.

Diagnostic coverage is the ability of the system to detect dangerous failures, mathematically defined as the ratio of dangerous detected failures to dangerous failures.

Consider a programmable logic controller (PLC) system, such as the one shown in Figure 2, whose safety function is to open the switch in order to stop the machine if the input goes beyond a particular value. In the HFT = 0 figure, if there is a single random failure (X) then the system will malfunction and the machine will not stop. Now, if we have a redundant path as shown in the HFT = 1 figure, then a single random failure will no longer cause the failure and we will be able to stop the machine.

So, by adding a redundant path, a single failure can be tolerated; this system is called HFT 1 system, which says that one failure cannot cause the system to fail. HFT 0 means one failure can cause the system to fail. Hardware fault tolerance is the ability of a component or subsystem to perform a safety function in the presence of one or more dangerous faults.

HFT can be calculated from architectures like 1oo1, 1oo2, 2oo3, etc. If the architecture is expressed as MooN, then the HFT is calculated as N – M. In other words, a 2oo4 architecture has an HFT of 2. This means it can tolerate two failures and it still works, and thus it is an architecture with redundancy.

Table 1 plots SFF (which is the amount of diagnostic coverage) and hardware fault tolerance (meaning the redundancy).

Table 1. SIL Level Coverage [click for full size image]

The rows show the amount of diagnostic coverage, whereas the columns show the hardware fault tolerance. HFT of 0 means that if there is one fault in the system, the safety function will be lost (see Table 1).

If we add redundancy achieving HFT 1 as shown in Figure 2, the system can tolerate one failure without the system going down. So, customers who achieve SIL 3 with redundancy today could achieve a SIL 3 rating without redundancy if they use a part with higher diagnostic coverage.

So, with a higher level of diagnostics we reduce the amount of system redundancy needed, or we improve the SIL level of the solution with the same level of redundancy (move down on Table 1).

Now, let’s recall the diagnostics in the AD7124-4/AD7124-8, which support various built-in mechanisms like power supply/reference voltage/AIN monitoring, open wire detection, conversion/calibration checks, signal chain functionality check, read/write monitoring, register content monitoring, etc. that boost the diagnostic coverage of the AD7124-4/AD7124-8 system. In the absence of these diagnostics, two ADCs would be required to achieve the same desired level.

Hence, one AD7124-4 or AD7124-8 provides the same level of coverage and its diagnostic coverage and features enable design for a functionally safe system. This results in 50% savings in BOM and printed circuit board space.

Documentation to Support SIL Rated Designs

The documentation needed for aiding end system SIL certification are:

These documents are comprised of inputs, primarily from four sources of data, as shown in Figure 3. These data are diagnostic data, design data, FIT rates, and data from fault insertion tests.

click for full size image Figure 3. Functional safety documentation information flow.

The AD7124-4/AD7124-8 FMEDA analyzes the main blocks in the application schematic, identifies failure modes and effects, and checks the diagnosis and analyses for a particular safety function. Let’s look at Figure 4 to understand the mechanism. For an RTD type system, the safety function is to measure temperature with an accuracy of ±x degree; the application schematic is shown in Figure 4.

click for full size image Figure 4. An RTD application schematic diagram.

We define a dangerous fault as a fault that can lead to an error in the ADC output or SPI communication, and if the error in the output is significant, it can cause a dangerous failure.

Safe state is defined as:

The AD7124-4/AD7124-8 are identified as a Type B system according to IEC 61508. To explain the FMEDA, let’s take the example of the clock module and analyze its failure modes.

Table 2 shows what happens when the clock block faces the failure modes described in the first column, its effect on output, the amount of diagnostic coverage, and lastly the analysis.

Table 2. Master Clock Block Failure Mode, Effects, Diagnostics, and Analysis [click for full size image]

Similarly, we then analyze the remaining blocks in the AD7124-4/AD7124-8.

Note that there may be some failures that may not impact the safety function; for example, the failure on the AIN0 pin will not cause problems for temperature measurement and, hence, can be excluded from the safety calculations.

The outcome of the FMEDA will be failure rates of safe failures, dangerous detected failures, and dangerous undetected failures, which are used to calculate the SFF.

The pin FMEDA analyzes various types of failures on the pins of the AD7124-4/AD7124-8 and their outcome for this RTD application. Step by step, we take every individual pin and analyze the outcome in case the pin opens up or shorts to supply/ground or shorts to adjacent pins.

click for full size image Figure 10. 32-lead LFCSP pin configuration.

For example, let’s take Pin 29 (DIN) from Figure 5, refer to the application schematic shown in Figure 4, and check the outcome for different failures. Table 3 shows the failure mode, effects, and detection.

Table 3. Failure Mode, Effects, and Analysis for Pin DIN [click for full size image]

Note that the analysis is with respect to the application schematic shown in Figure 4, so the analysis of an unused pin will not impact anything.

This is a design measures checklist for ASICs avoidance of systematic failures. A completed Annex F checklist from IEC 61508-2:2010 is needed for compliance.

Safety Manual or Data Sheet

An entire set of information finally flows into the safety manual or data sheet, which provides the necessary requirements to enable the integration of the AD7124-4/AD7124-8.

When showing compliance with the IEC 61508 functional safety standards, the safety data sheet collates all the diagnostics and analyses that flow in from various documents. It will have all the information such as:

Route 2S, Also Known as Proven in Use

We have discussed the first method for assessment. Now, let us discuss the alternate method known as proven in use or Route 2S. This method is applicable for a released part and is based on an analysis of customer returns and the number of devices shipped.

This allows SIL certification as if the part was fully developed as per the IEC 61508 standard.

Route 2S or a proven in use claim may be available to module/system designers if they have successfully used an IC in the past and know the failure rate from the field.

Note that, in Route 2S, we need the entire data of field returns, which makes this claim much harder for integrated circuit designers or manufacturers as they generally do not have enough knowledge of the final application or what percentage of the failing units from the field are returned to them for analysis.

The ADC and system requirements for an RTD measurement system are quite stringent. The analog signals generated by these sensors are small. These signals need to be amplified by a gain stage whose noise is low so that the amplifier’s noise does not swamp the signal from the sensor. Following the amplifier, a high resolution ADC is required so that the low level signal from the sensor can be converted into digital information. Along with the ADC and gain stage, a temperature system requires other components such as excitation currents. Again, these must be low drift, low noise components so that the system accuracy is not degraded. Initial inaccuracies such as offset can be calibrated out of the system but the drift of the components with temperature must be low to avoid error introduction. So, integrating the excitation blocks and measurement blocks simplifies the customer design. When designing for functional safety, there is the additional need for diagnostics. By integrating diagnostics along with the excitation and measurements blocks, the overall system design is eased, reducing the BOM, design time, and time to market.

Documentation such as FMEDAs contains all the information required by customers to certify the component in the end design. However, certifying the components themselves eases the conversation with the certification house further. The Route 2S process allows products to be certified postrelease so this is a useful route as there are many devices currently released, which suit functionally safe designs.

Note: Figures and charts are courtesy of Analog Devices.

For more Embedded, subscribe to Embedded’s weekly email newsletter.

You must Sign in or Register to post a comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed.