Reliability centered maintenance (RCM) is the ongoing, systematic process of matching critical systems with the most cost-effective maintenance strategy to maximize overall reliability. Basically, it’s finding the type of maintenance that gives you the biggest bang for your maintenance buck for each of your most important assets.
Why do you need different asset maintenance strategies for different assets?
There’s no such thing as a one-size-fits-all solution in maintenance management, and unfortunately, it’s because there are so many ways to fail. In fact, when it comes to failing, every asset can have its own ways, reasons, and consequences, which means you need different strategies to avoid and deal with them.
What is reliability centered maintenance (RCM)?
Reliability centered maintenance is the process of finding the best possible maintenance strategy for every asset in your organization. The guiding principle is that different assets require different styles of maintenance management.
Some demand continuous high-tech monitoring, while others are best left to the run-to-failure model. It’s up to you to decide which one goes where.
Although reactive maintenance often has a bad name in maintenance, there are times when it’s the best way for you to meet your organization’s business goals. The classic example is light bulbs, which almost always have the lowest level of criticality.
They are cheap to buy and carry in inventory. When they fail, there’s little to no safety risk and you’re not running the risk of lowered productivity. And even the most inexperienced tech can replace them.
Implementing a schedule of even the most basic inspections and tasks for light bulbs, then, is very likely a waste of your limited time and money, two resources always in short supply.
The process of finding the best strategy begins with looking at your history of breakdowns and the steps you’ve been taking to maintain and repair your assets. From there, you choose the best maintenance strategy.
The end goal of reliability centered maintenance is achieving consistently high levels of reliability at the lowest possible costs. Again, the biggest bang for your maintenance buck.
In a recent Eptura webinar on RCM, facility management consultant and IFMA Fellow Michel Theriault explains, “Reliability centered maintenance is not just about reducing failure of critical things. It’s actually about using their correct and most efficient maintenance strategy to reduce failure. So, that is the part are we most interested in in facilities, because let’s face it, we don’t have all the resources we would like to have.”
RCM and the connection to the concept of continuous improvement
If you’re familiar with the concept of continuous improvement from Lean, a lot of RCM likely sounds familiar. For both, you use them to focus resources on the activities that deliver the most benefit while “removing waste.”
But Lean is a complex and involved system, for example it has a fully fleshed out set of seven categories for waste, while you can think of RCM as continuous improvement specifically for maintenance, with the one goal of increasing reliability.
Which brings us an important question.
RCM and the concept of a preventive maintenance program
It’s not really either or when you’re thinking about RCM and preventive maintenance. In fact, a reliability centered maintenance analysis is one way to get a great preventive maintenance program.
Remember, preventive maintenance is where you use a combination of inspections and tasks to find and fix small issues before they have a chance to grow into large, expensive problems.
At first, when an asset is new, it often makes the most sense to follow the manufacturer’s suggestions for how and what you do. They designed and built the asset, so they know it best. But over time, you are the one with the better understanding of the asset’s histories for maintenance and repairs.
And you also have a better understanding of where that asset fits inside your overall organization. Combined, that means eventually it’s up to you to set not only the inspections and tasks but also their timing.
Theriault explains, ““After the warranty period, of course. But RCM is about ignoring what the manufacturer tells you, and considering your own individual, specific circumstances.”
What is the “reliability” in reliability centered maintenance?
Google “reliability” and you quickly learn it’s “the probability that a component or system will perform a required function for a given time when used under stated operating conditions.”
Although we’re working with a popular definition of reliability, it’s also one of the longer ones. Let’s pull it apart and take a closer look at each section.
the probability that a component or system
Reliability is a measurement of asset and equipment performance. And you can also use it to measure the performance of specific parts inside those assets and pieces of equipment. So, for example, you can measure reliability for an entire car or just the small motor that drives the pump for the window washer fluid.
Reliability is connected to probability, which means it’s a way of answering the question “What are the chances?”
will perform a required function
On the most basic level, here we’re just talking about the asset or equipment working the way we want it to work. It’s not failing, and that covers both complete and partial failures.
Complete failure is when your car is sitting at the side of the road with a flat tire. Partial failure is when the engine timing is off, affecting performance. Although we tend to worry more about complete failures, partial failures can be even more of a problem.
With a complete failure, you know there’s a problem and can take steps right away to fix it. But with a partial failure, you might not even know anything is wrong, leading to even more damage or, for example on a production line, a lot of low-quality parts.
for a given time
When we calculate reliability, we need to factor in time.
In fact, time is a huge part of reliability. One way to think about reliability is how long you can trust an asset to run before it fails.
And a basic formula for calculating reliability is the hours of operation divided by the number of failures. If that looks familiar, it’s because it’s the basic formula for mean time between failures (MTBF).
When we’re looking at something’s reliability over time, the formula becomes more complicated, but time is still a big part of it.
when used under stated operating conditions
You could calculate reliability for that car based on driving it to and from work five days a week for a month. But what if you drove the same route the same number of times in a month but now at ten times the average speed? Your first set of calculations no longer applies. As soon as you change the operating conditions, you have to go back and redo your reliability.
And that can be a huge issue for your maintenance department if the front office decides to improve their ROI on assets by running them longer, harder. So, that reliable pump that’s only ever been cycling a thousand times a day is going to start failing a lot more often when it’s pushed to two thousand cycles a day after the front office adds a second shift to the production schedule.
Can you keep it up and running with the same level of reliability? Of course, but you first need to rethink the maintenance program you’re running for that asset, including the frequencies for all your inspections and tasks.
What is the difference between reliability and availability?
That’s likely the more common question, but it might be more helpful to ask, “What is the connection between reliability and availability?”
The connection is that availability is a function of reliability, and the two have a direct relationship. What this means is that because reliability can be a part of the formula you use to calculate availability, the better your reliability, the better your availability. The direct relationship means that when you increase reliability, you also increase availability.
In fact, you can think of reliability as sugar and availability as a glass of water. How sweet is your availability? The more reliability you add, the sweeter your availability.
That’s the relationship between the two. But getting back to the original and more common question, what are the differences?
Just like with reliability, availability is also related to probability. It’s a way of answering the question “What are the chances?” But here you’re looking at a different “chances of what.”
Availability is the probability an asset or piece of equipment is available when you need it. So, you take the actual working time and divide it by the scheduled working time. Then take that result and multiply it by 100 to get your availability as a percentage.
Someone scheduled an asset to be running for 100 hours. At the end of the 100 hours, though, you know the asset was up for 89 hours. Eighty-nine divided by 100 is .89, and that multiplied by 100 is 89%. In a perfect world, you’d see 100 hours of uptime for every 100 hours of scheduled time. But in the real world, you can shoot for 90%.
When trying to establish your uptime, remember to subtract both scheduled and unscheduled downtime. If an asset fails halfway through a production run, and it takes the maintenance techs an hour to get it back online, that pulls the availability percentage down.
But when that same asset is offline for an hour because you scheduled a set of inspections and maintenance tasks as part of a preventive maintenance program, the effect on the availability percentage is the same. Downtime, for any reason, has a negative effect on your availability.
The good news is that you now have two ways to improve availability. The first is by scheduling preventive maintenance inspections and tasks so they don’t interrupt production. Instead, you can schedule PMs for when you already know the asset is set to be offline. For example, in between shifts or production runs.
The other way is by increasing reliability. The more reliable an asset, the higher the chances it’s online when you need it.
What is a reliability centered maintenance analysis?
Reliability centered maintenance started in the aviation industry, which is unsurprising given the numerous parts and components that comprise aviation equipment, their heavy use, and the risks and potentially catastrophic consequences of aviation equipment failures.
Over time, organizations across industries have implemented the minimum criteria set out for RCM methods in technical standard SAE JA1011 — Evaluation Criteria for Reliability-Centered Maintenance (RCM) Processes.
The process involves working through a set of seven questions.
What is the asset or equipment supposed to do, and what are the associated performance standards?
Here, you’re trying to identify the system or equipment maintenance functions. In other words, you need to know how the equipment performs and its ability to meet company needs within the parameters of environmental safety and government standards.
You can find this information in the manufacturer’s documentation. You want to know the scope of the functions as well as their limitations and methods of use relating to safety and environmental measures.
For example, an industrial scale may have a weight limit. As soon as you exceed it, that scale starts becoming inaccurate or stops functioning. The documentation also explains how to use the asset to ensure both safety and accuracy. There could be instructions on how to place or handle the items you want to weigh and where to keep the scale.
What does this asset do, how much of that is it doing currently, and how much of that would I like it to be doing?
For example, you have a conveyor belt that moves boxes. Currently, it’s moving 5000 boxes between breakdowns, and each of those breakdowns lasts about three hours. Based on a combination of what the belt’s manufacturer says, what your maintenance team says, and data in your CMMS or EAM software, you think you can get that number up to 7000 boxes between breakdowns. You can also reduce each breakdown from three to two hours.
In what ways can equipment fail to provide the required functions?
Simply put, this means being able to identify failure modes in a piece of equipment. In other words, it involves determining the nature of the equipment failure. For example, does the failure relate to one part or is it a systemic failure?
The key is to identify exactly how a piece of equipment has failed, how often, and if it involves the same equipment part. In companies with several pieces of the same types of equipment, it is important to determine if a particular failure is occurring systematically on all pieces or if the failure is limited to only one piece.
What are the events that cause each failure?
Closely related to finding equipment failure modes, you also need to identify the causes of the failures. It’s important to determine why, when, and how equipment failures typically happen. This is particularly true of heavy-use equipment, which could suffer from operating fatigue. Also, you need to know when equipment is most likely to fail and the nature of the failure.
For example, you might run a water pump continuously, and at some point, the equipment starts to fatigue from the constant use. Another common type of equipment stress leading to failure is exposure to harsh environmental conditions such as heat, cold, or moisture.
There is also human error as well as inherent design or manufacturing flaws that cause equipment failure. Finding out the cause of the failure is important to understanding how to prevent or minimize it.
What happens when each failure occurs?
To improve your operations, you need to do more than just identify equipment failures. You also need to know their effects, which can range from nearly undetectable to complete losses of function. For example, a failing piece of equipment might lead to a decrease in output speed or quality. Or it might smoke, stutter, and seize.
In the end, all forms of equipment failure impact productivity, operations, and capital costs. They also lead to unplanned disruptions in production and expensive repairs you wish you had avoided.
In what way does each failure matter?
Here you’re looking for failure fallout. Apart from the financial and logistic consequences of equipment failure, you need to think about safety risks for operators as well as possible environmental impacts.
You also need to consider how a failure affects the integrity and condition of an asset overall.
What systematic task can I do proactively to prevent or diminish the consequence of failure?
The answer to this question is hiding inside the asset’s maintenance and repair history. By looking at who did what and when they did it, you can start to see breakdown patterns. Once you have the pattern, you can start to slot in proactive preventive measures between breakdowns.
For example, the conveyor belt generally runs fine for about 5000 boxes before requiring some sort of repairs. If you add visual inspections after every 4500 boxes, you have a good chance of stretching out your uptime.
But be careful; the wording of this question can be a bit misleading. It’s about what you can do, but you also have to consider what you should do. There are situations when you should take steps to avoid breakdowns. But there are also situations where it’s going to be better to simply continue to use the run-to-failure maintenance strategy.
When the cost and trouble of avoiding breakdown are more than the value of the increased uptime, it makes more sense just to let things run until they fail. Back to the classic example, think light bulbs.
What should I do if I can’t find a suitable preventive task?
Here we’re dealing with a very specific situation: the best maintenance strategy is not run-to-failure, but at the same time we can’t find a good proactive preventive maintenance plan to apply.
Imagine you have an old A/C unit in your machine shop. In fact, it’s so old that you can’t source parts for it anymore. And it runs on a coolant that used to be common but is now in the process of being phased out through legislation. You can’t maintain it by refilling the coolant and you can’t repair it by switching in new parts.
Because you can’t set up a maintenance strategy, all you can do is have a plan in place for when the A/C inevitably dies. That might mean having money already set aside in the budget to buy a replacement. It might mean borrowing a unit from another department’s inventory.
There’s no perfect answer, but you want a solution that can be implemented quickly, with the least amount of disruption.
As you work through the seven questions, you find the best possible maintenance strategy for each asset. It’s important to remember that answers can change over time. Any given asset can shift in criticality, and the costs associated with different maintenance strategies can increase or drop due to many factors, both internal and external.
What are the steps to reliability centered maintenance implementation?
Organizations need to begin by looking at their assets in terms of criticality. Basically, you should ask, “How bad is it if this asset fails?” Then start to look at other factors, such as costs for maintenance and labor, risk of injury, environmental damage, lost productivity, and compliance-related fines. Once you’ve determined criticality, rank your assets from most to least critical.
Always remember that two or more identical assets can have different levels of criticality depending on where you’re using them. In the webinar on RCM, Michel Theriault explains that “An AC unit on the roof that serves a server room and one that serves an office… you would deal with those 100% differently.”
The rooftop AC units are the same, but the cost to an organization of hot employees versus overheated servers are different.
Once you have your ordered list, starting from the top, use the seven RCM questions on each asset. Based on the answers, you can determine the best maintenance strategy for each asset.
Crucially, RCM is an on-going process. Organizations need to periodically revisit earlier decisions, ensuring that their maintenance strategies change as business goals, asset criticality, and failure histories evolve.
For example, the best maintenance strategy for an asset early in its useful life is different from the one that’s the best fit 15 years later. And even though predictive maintenance or condition-based maintenance did not make economic sense for an asset five years ago, it might be the best choice after the price of sensors has dropped.
When asked in the webinar when was the right time to think about implementing RCM, Theriault says, “If you’ve always been using the same maintenance scheduling and task list, it’s absolutely time to rethink that.”
But one element that always remains the same is the need to sell people across departments on the value of the program. Theriault explains, ““You could be the smartest technician, the smartest manger, the smartest director, whatever your role is, you’re going to get nowhere if you can’t convince somebody else, higher up, to give you money, to give you resources to make it happen. Fundamentally, that’s the challenge we have, as I see it, going forward in our industry.”
Reliability centered maintenance is a process for finding the best maintenance strategy for each of your assets so that they deliver maximum value for the least amount of time, money, and effort, three critical resources always in short supply.
For example, it makes more sense to use run-to-failure for light bulbs. But for a forklift, you would want to use preventive maintenance. The process of determining the best strategy for each asset is an RCM analysis, which involves asking a set of seven questions.
For each asset, you use the questions to determine the function, failure modes, and how best to avoid them. From there, you decide on a tailored program to increase reliability.
It is important to remember that identical assets may require different maintenance strategies, if they have different levels of criticality. For example, an AC unit that cools the breakroom in an office has a much lower criticality than the identical AC unit that’s set up to cool a server room running critical business software.