Maintenance and Optimal Stopping

Imagine we have a machine in a factory that makes products which are sold to the general public. Ideally we want to make as many products as possible so we can sell them all to make a nice healthy profit. This means that we really want to be running the machine all through the day with no unnecessary stoppages.

Realistically however we know that machines do in fact breakdown and companies employ teams of people whose job it is to maintain the machines so that they can keep them running or repair them if they do break. Generally these maintenance teams will have a planned schedule of times when they will stop the machine in order to do their necessary checks and servicing.

This approach works well except for the fact that there are often conflicting views between the maintenance department and the production department. The production department wants to make sure that products are always being made (as few stops as possible), whereas the maintenance viewpoint is that the aim should be to stop the machine breaking down (needing a lot of stops). This poses the question of how to find the optimum balance between having a reliable machine and a productive machine?

There is also a need to take into account the fact that all the components in a machine will a general degradation over time. If a component has been used for too long past its lifetime then the chance of it breaking will be very large. Since a break down is much more costly to a company than any maintenance activity, we want to perform additional maintenance based on the lifetime of the components rather than someone’s opinion.

This is better known as ‘just-in-timemaintenance, where we act on the machine just before it breaks down due to wear and tear. One way of doing this that causes minimal disruption to the planned schedule is to turn one of the scheduled maintenance activities into this just-in-time maintenance.

The aim is then to be able to determine the last possible stop during which this just-in-time maintenance can be performed, before the machine will break down due to general wear and tear. This links into the mathematics of optimal stopping theory.

Odds Algorithm

One clever way to solve this problem of optimal stopping is by using the Bruss or Odds-Algorithm. The key point about the Odds-Algorithm is that it has been shown to actually be optimal and it produces a win probability that is greater than or equal to the well known lower bound of 1/e = 36.8%.

The algorithm starts by looking at a series of n independent events and associates with it another set of variables $I_1, \dots, I_n$. These are binary and represent whether the event is something we would like to stop on (=1) or not (=0).

$\text{Prob of stopping on a success, } p_i = P(I_i = 1)$

$\text{Prob of not stopping on a success, } q_i = (1 - p_i)$

$\text{Odds of stopping on a success, } r_i = p_i/q_i$

The algorithm starts by summing up the odds in reverse order,

$r_n + r_{n-1} + r_{n-2} + \dots,$

until the sum is equal to or greater than 1 for the first time. If this happens at an index s, we save the number s and the sum above as $R_s$,

$R_s = r_n + r_{n-1} + r_{n-2} + r_s$

If for some reason the sum of the odds doesn’t get to 1, we take s=1. In addition we also compute,

$Q_s = q_n q_{n-1} \dots q_s$

Then the results of this algorithm are s, the stopping threshold and $Q_s$ the probability of meeting your aim.

In a more mathematical form the answer to the problem is given by,

$s:= \sup(1; \sup \{1 \leq k \leq n | \sum_{j=k}^n r_j \geq 1 \} )$

$\text{Optimal Reward} = (\prod_{j=s}^n q_j)(\sum_{j=s}^n r_j)$

A simple example might be the following. Suppose we roll are prepared to roll a dice 25 times, but we want to stop on a number 5. Then using the Odds-Algorithm we can find that,

$p_i =\frac{1}{6}, q_i = \frac{5}{6}, r_i=\frac{1}{5}$

Then we see that the required reverse sum to 1 is $r_{25} + r_{24} + \dots + r_{21} = 1$, so we find that s=21. Then the probability of actually ending on a 5 using this strategy is $(5/6)^5 \approx 40.19\%$.

Maintenance Example

It turns out that we can actually use the odds-algorithm to solve the maintenance problem we described earlier. In our case the n independent events are the times when production is stopped and the variables $I_1, ..., I_n$ represent whether the just-in-time maintenance can be performed at that particular stop.

The probability p_i of whether the just-in-time maintenance can be done might be represented through a reliability term and a maintenance term. This reliability function X of the machine depends on the starting time a_i of the production stop. The longer the production stop is into the operating cycle, the less likely that the machine has not yet broken down. The maintenance function M might will probably depend on the length of the stoppage of production d_i. The longer the production stop the more likely that the maintenance can happen during it, so an exponential distribution might be appropriate.

$p_i = X(a_i)M(d_i)$

$q_i = 1 - X(a_i)M(d_i)$

$r_i = \cfrac{X(a_i)M(d_i)}{1 - X(a_i)M(d_i)}$

Then substituting into the formula used by the odds algorithm we can find the best stop to do just-in-time maintenance at and the chance that this stop will turn out to have been optimal in hindsight.

$\text{Optimal Stop, } s:= \sup(1; \sup \{1 \leq k \leq n | \sum_{j=k}^n r_j \geq 1 \} )$

$\text{Optimal Reward} = \left(\prod_{j=s}^n (1 - X(a_j)M(d_j))\right) \left(\sum_{j=s}^n \cfrac{X(a_j)M(d_j)}{1 - X(a_j)M(d_j)}\right)$