First time fix rate (FTFR) is a metric commonly used in the lift industry to describe the efficacy of an unscheduled service visit. One definition of the term is that the fault does not reoccur for 30 days. Let’s imagine a scenario in which this might play out. Technician A is dispatched to the site of a breakdown. The lift is not running, and she can find no apparent reason for it being out of service. She cycles the isolator switch, and the lift starts to run. On the service ticket she reports: “Found lift out of service, adjusted door locks on floors one, two and three. Car returned to service”.
Two weeks later the lift has a second breakdown. Technician B is dispatched to the same site and he also cannot find the cause of the problem. He cycles the isolator switch, and the lift starts to run. On the service ticket, he reports: “Found lift out of service, adjusted gate switch and cleaned car door tracks. Car returned to service”.
Three weeks after the first breakdown, technician C was dispatched to the site to upgrade the controller software. The lift has not been reported out of service and technician C finds the lift running properly. The technician installs the new software, and the lift is still functioning properly after 90 days.
Although the two breakdowns occurred within 30 days, because the problem resolutions were different, both breakdowns were reported as being fixed the first time. Because a software upgrade prevented further breakdowns, the root cause of the breakdowns was a software bug. In other words, the two breakdowns were not fixed the first time.
An amazing percentage of calls can be fixed by powering down the system and powering it back up again. That doesn’t fix the problem; it hides it, but as the issue is temporarily resolved, the building management isn’t pressurising the lift engineer for a resolution. The truth is, not everybody is good at troubleshooting lift faults, although the skill can be improved with training.
Here is a second example. It is very common for service technicians dispatched to repair an out-of-service lift to find the lift running on arrival (ROA). The assumption is that there never was anything wrong with the lift and that the building management company failed to verify that the lift was truly out of service. However, those with experience in service operations have found the ROA calls usually involve a real problem that is intermittent, or condition-based.
For example, if a door lock were to be mis-adjusted, the lift will function properly most of the time. But if passengers are distributed in the lift car in a particular way, the car may tilt slightly and cause the door to not properly lock. In this case the car will not run until the passengers exit and the car balance changes. With the change in car balance, the car then runs. The technician, unable to recreate the particular loading pattern, and unaware of the real issue, reports an ROA event.
Common metrics such as FTFR and ROA cloud the operational picture, and could create confusion when implemented in machine learning. We mustn’t fool ourselves, and we mustn’t fool our customers.
PM AND THE CONFUSION MATRIX
A predictive maintenance (PM) system aims to find faults before they occur. Whether the system has a direct connection with the datastream of the lift control system, or monitors its condition through standalone sensors such as vibration sensors, the purpose of a PM system with machine learning is to predict that a lift will fail in the near future and send a technician to the site to fix the lift before it fails.
When a prediction is made that a lift will fail, there are two possible outcomes: a true positive (TP), in which it was predicted that the lift would fail and it does, or a false positive, in which it was predicted that the lift would fail and it does not. Conversely, when it is predicted that a lift will not fail, there are also two outcomes: a true negative (TN), in which it was predicted that the lift would not fail and it did not, and a false negative (FN), in which the lift was predicted not to fail, but it failed.
Here is another hypothetical scenario. A building complex with 100 lifts equipped with IOT remote monitoring devices are connected to a cloud-based predictive analytics system. This system has predicted that 10 of the 100 lifts will fail in the next two weeks. This also means that the other 90 lifts will not fail.
A technician is sent to each of the 10 units that are predicted to fail, and finds eight lifts that have a condition that if not repaired will definitely cause the lift to fail in two weeks. The technician fixes them. During the next two weeks, the two lifts that were predicted to fail do not fail. However, three other lifts do. That means that of the 10 predictions, there were eight true positive predictions and two false positives. Additionally, there were three false negative predictions.
One method of visualising these results is a confusion matrix.
In the real world, these results will be viewed differently by the building complex manager (the customer) than the lift company’s branch manager. From the perspective of the former, prior to adding the IOT equipment the complex would have experienced 11 breakdowns. The IOT system prevented eight breakdowns. Therefore, the complex manager is satisfied with the improvement.
On the other hand, from the branch manager’s point of view, with or without IOT, 11 lifts would need to be repaired. With IOT, two additional service visits to the lifts that did not fail were required. The branch manager is unhappy because the two visits to lifts that did not fail will reduce the branch profitability.
From this example, we can conclude that false positives are an important metric. The lower the occurrence rate of false positives, the happier will be both the building manager and the branch manager.
Prior to installing IOT equipment, all breakdowns were false negatives. Fixing a lift before it fails should come at a lower cost, particularly if it can be repaired during normal working hours. Eliminating false negatives should improve both customer satisfaction and operational efficiency. False negatives can also be labelled unpredicted failures. The reduction of unpredicted failures is a significant metric.
CONCLUSION
In the USA, lifts have been used for 160 years, and there are now one million in operation. In China, half that number are installed every year. In the US, many lift engineers are second or third-generation people. Countries like China or India, which has a similar population and demographic shifts to urbanisation, don’t have that heritage. The same thing is happening in India; it has a similar population and urban growth and shortage of specialists. How will they train all of the people needed to maintain their lifts? Machine learning can help to interpret error codes and provide servicing instructions.
Also, a lot of PM is now actually just inspections by technicians. That kind of work has no maintenance value; you aren’t doing anything to make the lift run better. Automating this process creates a big labour efficiency improvement, because machines can do inspections better than humans. One study found that 45% of a technician’s time could be replaced by machine learning on a system with proper sensors.
This article is based on a presentation to the University of Northampton’s14th Lift and Escalator Symposium which took place in September 2023. Rory Smith has worked in the lift industry for 54 years, and a decade ago led the industry’s first AI system, TK Elevator’s TK Maxx. He retired in 2019.
BOX: DEVELOPING MACHINE LEARNING ALGORITHMS
Getting an algorithm finished and working, and generating alerts, is not easy. Multinational lift brands have teams of data scientists working with subject domain experts; having representatives from both is essential. Data scientists can’t sort lifts; lift technicians can’t develop machine algorithms. Ultimately, they need to be working together long enough so there is cross-training, where each group begins to understand the other’s subject domain. The author was involved in a project along these lines 10 years ago (to develop the TK Elevator Max system). Based on a single or a sequence of error codes, that system indicates the four most probable causes of a given fault.
BOX: OTHER USEFUL METRICS
Precision represents accuracy of the predictions that were made. It is defined by the following formula: Precision=Tp/(Tp+Fp) where Tp represents true positives and Fp represents false positives.
Recall represents the number of true positives that were predicted compared to the number of true positives that could have been predicted. Recall is defined by the following formula: Recall=Tp/(Tp+Fn) where Tp represents true positives and Fn represents false negatives.
Ideal values of both precision and recall are 1.0. Recall and precision are not metrics of the predictive maintenance system but rather are metrics of the machine learning system that is a part of the predictive maintenance system. If the ML system has recall and precision values of 1.0, but they are ignored and no actions are taken to fix the lift before it fails, then the predictive maintenance system has no value to either the building or the lift maintenance provider. It should also be noted that if the ML algorithm is conservative, then there can be high precision levels, because there will be no false positives. However, recall values will be low because false negatives will be high.