Analyzing failure to prevent problems

By Mohammed Hamed Ahmed Soliman

Executive Summary
Failure mode and effects analysis (FMEA) was initiated by the aerospace industry in the 1960s to improve the reliability of systems. It is a part of total quality management programs and should be used to prevent potential failures that could affect safety, production, cost or customer satisfaction. FMEA can be used during the design, service or manufacturing processes to minimize the risk of failure, improving the customer’s confidence while also reducing costs.

One ISO requirement is to have a method or system capable of controlling the process that determines the acceptability of product or service quality. Failure mode and effects analysis (FMEA) is a good tool for improving the reliability of the product and its lifecycle. The tool can maximize the mean time between failures by reducing the probability of failure, extending the lifecycle of the product. This can be done during the design phase, manufacturing phase or maintenance service.

FMEA is a risk management tool that is designed to work as a preventive method rather than a corrective one.

Organizations that use the tool as a corrective method will find that it does not work as intended. FMEA can be quite useful for design engineers in the design phase of the product, as well as research and development engineers to help them develop new products with better reliability, quality and safety. FMEA helps manufacturing engineers control the process and eliminate errors during production, thus decreasing warranty costs and wastes. Service engineers can use FMEA to improve the lifecycle of the product and lower its service costs by developing a proper maintenance program.

A team and a process

FMEA is not a job for one individual. The best possible results come when teams are composed of contributors from different engineering perspectives. The team should have between four to six members. Team size is determined by the number of areas affected by the FMEA, such as manufacturing, maintenance, design, engineering, material and technical service.

The customer adds another unique perspective and should be considered for team membership. If customers cannot be included, the team should devise ways to generate voice-of-the customer data.

The team should have a leader who acts as a facilitator, not a decision-maker. The team leader’s main goals are to ensure that all resources are available, coordinate the meetings, and make sure the team moves toward completing the FMEA process.

Brainstorming is a well-known technique for generating a large number of ideas in a short time period. It’s preferable to use this tool during the start of an FMEA process to determine potential failure modes for each component your team is studying. Brainstorming also helps find the root causes of each failure mode.

To encourage ideas, no theory should be critiqued or commented on when it is first offered. Each idea should be listed and numbered, exactly as offered, on a flip chart. Expect to generate at least 50 to 60 concepts in a 30-minute brainstorming session. Brainstorming sessions should follow four general rules: Do not comment on, judge or critique ideas at the time they are offered; encourage creative and offbeat ideas; the goal is to end up with a large number of ideas; and evaluate ideas later.

FMEA has sequential steps that were summarized in the book Basics of FMEA, Second Edition, by Raymond J. Mikulak, Robin McDermott and Michael Beauregard. Some of the steps are obvious, but others aren’t. A basic outline follows.

1. Select a high-risk process. This will depend on the criticality of the process and how a failure in this process can affect safety, environment, health, production or costs. For example, a generator that will supply electricity to a firefighting system during emergencies is a critical safety component and must be considered during an FMEA because failures in such situations cannot be accepted.

2. Review the process. This process involves assigning a team that includes people with various job responsibilities and levels of experience, such as the design engineers, maintenance engineers, production engineers, process engineers, safety engineers and environmental engineers. The purpose of the FMEA team is to bring a variety of perspectives and experiences to the project.

If the process is a manufacturing process, then the team should review the process flowcharts and walk through the process at the gemba (the place where the work is done) to observe the real situation and collect all the data needed.

If the process is a product or machine, then the team should review the assembly drawing. The product should be tested, and every team member should be able to operate it and see how it works.

In this step, everyone in the team must have full knowledge of how the process works and operates.

3. Break down the system into components and subcomponents. If the system is a large system, like a water system that supplies an industrial process, the pump can be a critical component inside the system. A motor pump is a critical subcomponent because its failure can break down the entire process. The motor pump should be broken down into more subcomponents that are likely to fail and will affect the system, such as the motor’s bearings and the rotor shaft. The FMEA will be used to prevent the probability of failure for each component or subcomponent.

4. Brainstorm potential failure modes. Once everyone in the team has a deep understanding about how the process or product works, the team can start thinking about things that could happen to affect the process. After a brainstorming session, organize the ideas by grouping them into categories. Categorizing failure modes can be done using many different ways, including failure type (i.e., electrical, mechanical or user-created).

A failure mode is an event that causes a functional failure, any of the myriad ways in which a product or process can fail. Examples of failure modes abound. Low discharge pressure could be a compressor failure mode. Knocking could be an engine failure mode. Seized bearings are a bearing failure mode. Burnout is a motor failure mode. A dead battery is a car battery failure mode.

Note that failures are not limited to problems with the product, and failures could be tied to user mistakes. Those types of failures should be included in the FMEA. Anything that can be done to ensure the product works correctly, regardless of how the user operates it, will move the product closer to 100 percent total customer satisfaction. The use of mistake-proofing techniques, also known by its Japanese term poka-yoke, can be a good tool for preventing failures related to user mistakes.

For example, an FMEA involving a coffee maker could try to engineer out the user mistake of putting too much or too little ground coffee in the filter. This will ensure that the machine is making the right coffee with the same quality of taste for all users.

5. Assign an effect for each failure mode. Each failure mode should have an effect that determines the severity of the failure. It is also known as the consequence of failure.

The effect of a failure mode on the system is influenced by the availability of standby or redundancy in the system. For example, a transformer that supplies electricity is critical, but the existence of a standby generator will reduce the criticality of the system. However, this performance must be considered and compared. If the transformer failed, would the generator be able to supply the electricity needed with the same efficiency? What is the time interval between when the transformer fails and when the generator starts to work? Will any failures have a severe effect on the product, the process or the whole system that will cost a lot of money to repair?

One failure mode could have several effects. For example, an electrical cutoff in the home could stop the refrigerator and damage food or prevent you from doing work on the computer.

Several failure modes could have one effect. A dead car battery or tire failure has the same effect on your vehicle – it will be difficult to make it to work on time with such a failure early in the morning.

The team must determine the end-effect each failure mode has on the system or the process. This means examining how each failure affects the entire system, the facility or the other connected processes.

6. Assign severity rankings. Severity, occurrence and detection are each ranked on a 10-point scale, ranging from one as the lowest ranking to 10 as the highest. Figure 1 shows a standard example of rankings for all three. In the severity category, potential safety, health and environmental failure modes generally indicate high risk, with rankings of nine and 10. Production losses and costs rank from a low of two to a high of eight, depending upon the length of potential delays and the severity of their effects on the entire system.

7. Assign an occurrence ranking for each failure mode. Occurrence is the probability of failure during the product’s expected lifecycle, usually determined using the failure log history. But when historical data are not available or the failure never has occurred before, the team can determine the causes of each failure mode with techniques such as the “five whys.” Once the potential causes are determined, the team can estimate an occurrence ranking.

8. Assign a detection ranking for each failure mode. First, the current control and prevention methods applied to prevent, detect or control the failure should be listed, reviewed and evaluated. The detection ranking should be assigned for each failure mode or effect based on the current control/prevention/detection methods. As with the severity and occurrence rankings, the detection ranking table in Figure 1 is standard. If one failure mode or effect has several causes, detection and occurrence rankings should be assigned based on these causes. When potential causes are eliminated, the risk of failure is lowered.

9. Calculate the risk priority number. The risk priority number (RPN) gauges the risk associated with potential problems identified during the FMEA process. It is useful for assessing risk and comparing components to determine priorities. The RPN is calculated by multiplying the severity, occurrence and detection for each failure mode or effect. The number can serve as a gauge to compare with the revised RPN once the FMEA process is completed and risk is lowered.

Many have commented that the “ideal” tables in Figure 1 do not exactly match their industry type or current conditions. But remember that the ideal is only a guide, and the tables can be adapted and changed as needed. However, it is important to keep the rankings from one to 10 so that the RPN scale has a minimum score of one and a maximum score of 1,000.

10. Prioritize failure modes to take action. This could be done with something like a Pareto chart and the 80-20 rule. Failure modes should be prioritized according to the risk number. High-risk numbers should be given attention first; then you can pay attention to the severity rankings. Thus, if several failure modes have the same risk priority number, that failure mode with the highest severity should be given more priority.

All RPNs above a certain cutoff point should be considered for improvement. The cutoff point number should be one that will improve at least 50 percent of the total risk priority number.

11. Take action to eliminate or reduce the high risk failure modes. Once the priorities are assigned, organize action through continuous improvement tasks and problem-solving approaches, implementing countermeasures to reduce or eliminate the high-risk failure modes.

Often, the easiest way to make an improvement to the product or process is to increase the detectability of the failure, lowering the detection rate number. Teams can improve the chances of detecting failure through modifying the preventive maintenance program, using a proper condition-monitoring method, eliminating the failure mode during the manufacturing process by changing materials or suppliers, or considering a mistake-proofing method during the design phase. An example would be computer software that automatically warns that you are running out of memory.

12. Calculate risk priority number as high risks are removed. After corrective actions have been taken to lower risks, recalculate the RPN. You can compare this revised RPN with the earlier number to gauge improvement. The expectation is that the FMEA approach will reduce the initial RPN by at least 50 percent.

There always will be a potential for failure modes to occur. The question the company must ask is how much relative risk the team is willing to take. That answer might depend on the type of industry and the seriousness of the failure. For example, the nuclear industry has little margin for errors, as minor problems could escalate into major disasters. Other industries might find it acceptable to take higher risks.

A case of reliable improvement

A good example of a successful FMEA process comes from the case of a system that supplied electricity to a glass-melting furnace in Egypt. The electric transformer is considered critical because a failure causes high production losses – $5,000 an hour. A standby generator could keep the furnace running if the transformer failed. The standby was sufficient to avoid damaging the furnace but did not supply enough electricity to continue production.

The team broke down the transformer into seven components: bushing, tank, core, winding, oil, tap changer and solid isolation. Each component has different failure modes. For each failure mode there is an effect. And for each failure mode and effect there are several causes. Figure 2 shows the seven components and their failure modes, effects, causes, RPNs, rankings, recommended actions and other details.

The severity ranking number was based on the effect of each failure mode. Most of the failures had a medium effect on production because standby was available. An occurrence ranking was assigned based on the potential causes of each failure mode and the historical data.

It is important to discover the problem’s root cause first because the cause will help determine the occurrence ranking. A detection ranking was assigned based on an evaluation of the transformer’s current preventive maintenance program.

The transformer’s maintenance program contained basic measurements and analysis on a monthly and annual basis. No advanced prediction methods were used to detect severe problems that might occur during the system’s operation.

A risk priority number was calculated, with a cutoff RPN of 16. All RPNs greater than 16 were considered for improvement. The FMEA calculated a total RPN of 540. Applying continuous improvement actions to all RPNs greater than 16 lowered the total RPN to 188. This revised RPN was a 65 percent improvement. The reduction percentage is calculated using this formula: (RPN - RPN revised) / RPN*100.

The improvements that yielded success included using ultrasound to detect issues, increasing the frequency of oil sampling and using infrared analysis to detect mechanical damage.

An FMEA process can trigger a number of such actions to improve a product’s service or maintenance processes. They include, but are not limited to: Increase the detection rate of high-risk failures using a proper technique to monitor conditions; increase the inspection rate for a specific component or part; modify the routine maintenance program; increase the frequency of replacing a specific spare part; modify the preventive maintenance schedule; change a spare part supplier; redesign a specific part in the system – or redesign the whole system; and use different types of materials or spare parts.

While the above example involves a piece of equipment and its parts, FMEA can be applied in many other areas, including the component proving process; the outsourcing or resourcing of a product; developing suppliers to achieve quality; major changes in processes, equipment or technology; cost reductions; and analysis of new products or designs.

Other important considerations

Failure mode and effects analysis can maximize a product’s reliability. But don’t mistake it as a standalone tool. For example, to determine occurrence ratings, FMEAs rely on the failure log history, and the documentation process also is important. Problem-solving techniques like “five whys,” brainstorming, fault-tree analysis and Pareto analysis must be engaged. These techniques will help determine potential failure modes; assign the severity, occurrence and detection rankings; and provide solutions or actions to eliminate those failures.

And it cannot be emphasized how important customers are for a successful FMEA. A proper FMEA process must consider not only failures related to your organization’s quality, but failures and mistakes that can be introduced by your customers. Collecting feedback is important. For example, Toyota’s recalls in recent years relied on its dealers and service centers to play a big role in collecting the important data needed to let Toyota know what changes needed to be made on its factory floors. The data was based on customer feedback and comments.

And, as industrial engineers and managers know, the best tools will not work without an inherent culture of continuous improvement. Everything runs a risk of failure. When failure happens, the important thing is to find out what the organization can do to prevent those failures from occurring in the future.

An FMEA is not a one-time job – it should be repeated continuously to keep the process improved. Once the quality and cost of your company’s offerings have been improved, competitors will try to match or exceed your value proposition. Continual FMEAs will bring your processes closer to perfection, so the continuous improvement culture should be embedded throughout all levels and with all employees.