Operational intelligence: a car analogyGo Back
- Posted by Albert
I often get questions when I speak on using operational intelligence software for testing that make me realize that the concept is not that easy to grab. The answer is obvious, we need a car analogy!. If you can’t explain a technical thing with a car analogy it probably isn’t true.
So here is a little, made up, story with a car analogy to explain what operational intelligence is:
John was looking at one of his screens. His desk was filled with two large flat screens. One of them was right in front of him with the other a bit to the right. On the right he had the necessary evils such as the corporate email and some terminals ready to drag to his main screen if needed.
His main screen showed mainly graphs and charts in a web browser. For a few years now this was what he was looking at during the day. If he didn’t have to answer emails that is.
John was working for a car manufacturer. He was proud to work for this company. It had disrupted the market with by building cars that had innovations that the other brands would not have considered possible. And he knew his work was important for the name it had built up for itself.
One of the things they had introduced three years ago was that each car was always connected to the internet and sending a constant stream of logging data to their servers. Before that the cars had always logged the same data in the on board computer. But that was only used when the cars were being serviced. Now all sorts of data of all the cars they had produced in the last years were coming in real time. And that was what he was looking at.
John was hoping a little bit for an issue, he still had to fill in some corporate survey and would love an excuse to get away from that.
At first is looked as if he was out of luck until he he saw the the graphs on the oil levels of the fleet. There was a negative trend there. Since a few weeks the number of cars that reported drops in the oil levels were on the rise.
John clicked on the graph to see what he could make of it. Until a few weeks ago there was a normal baseline. There would always be a few cars with issues in the fleet. But since a few weeks more and more cars reported dropping oil levels. Of course the owners had been automatically notified but what was going on.
He quickly opened the query on the graph and changed it a bit to show him a pie chart of the different models in the period of the increas. There was the first clue. Nearly all the cars with issues were of a certain type. Let’s check the mileage on these cars. Hmm, that didn’t help much. There were cars with widely different mileages.
John thought about it for a bit. What happened a few weeks ago? He produced another query. Fortunately his regex skills were not deserting him as he cross referenced the cars with issues and the data from service database. A pie chart appeared on his screen showing him which of the problem cars had been serviced in the last few weeks. Nearly all of them had been. But did all the cars that had been serviced have these problem. Another query quickly showed that it was still only a percentage of the total of serviced cars that showed these issues.
John knew that there was also data being indexed from a sensor that measured vibration and shocks. So he brought up the information from one of the impacted cars. He first sought out when the sensors started showing a dropping oil level. Then in the that timeframe he brought up the vibration data. He increased the timeframe a bit to see what was normal. And sure enough just before the drop in oil levels the vibration was significantly higher than the baseline. He did the same for a few more examples and it he checked for the cars that had no problems if they had high vibrations since their last service. The figures confirmed his suspicion: nearly all the cars that had been serviced in the last few weeks that had had strong vibrations, probably due to a bad road, would get the problem with the dropping oil level.
John made an email with the graphs and figures of his findings and sent it to the group responsible for maintenance. And with a deep sigh he started the nasty survey after all.
Moments after finishing his survey he received a reply. The maintenance group had found out, helped by all the information he provided what the root cause was. A few weeks ago they had changed the regular servicing to include the change of component in the engine. The new component was fine but they had neglected to update the installation manual. If you followed the old procedure, you would miss one step that would really tighten the component. So if you then went over a patch of bad road, the component would loosen up and start leaking oil.
John smiled. Before they had this system it would have taken months before it would even have been clear that there was problem and then possibly weeks before they could have narrowed it down. Now within hours they had found the issues, determined the root cause and were able to prove with nice graphs that management understood that a recall was in order.
His next task was to help the test department. They had a new prototype out that was being tested by this James guy. He was a bit different from what John expected from testers. Apparently he spent a lot of his time preparing for the tests of the new prototype by reading back issues of Top Gear magazine. When John asked about it, James just mumbled about context.
The new car was also streaming all the data from the on board systems to their operational intelligence system. They even had installed extra sensors and were receiving that data as well.
Now this James guy was taking the car for a drive. Apparently he had was trying all sorts of things drive fast, drive slow, accelerate all the things you would expect. John was watching the information on his screen as James was putting the car through the grinder. Then one of the temperature gauges on his screen started rising quickly. John notified the test leader that signal James to abort the test. They sent a mechanic over to check it out. In the mean time James, who waiting by the car called him. “Hé John, when I go over 90 Kph the car gets really noisy. For the rest it’s a really quiet car.”
Good that gave John something to sink his teeth in. He pulled up the information of the speed. The software automatically put it in a time scale. Then he combined it with the information of decibel meter which was installed in the car. The total sound volume had stayed below the required threshold, but indeed when the car was going over 90, it would increase significantly. He zoomed in on a period just before James had gone 90 and some time while he had gone over 90. He looked at other graphs of that period. As they had more sensors installed especially for the test, he also had vibration information on many of the components. He looked for anomalies. Most components that showed vibration had a pretty even and low increase except for the dashboard. At roughly 90, the vibration shot up. He called James back. “Hi, I think it’s the dashboard, see if the mechanic can do something about that”. “Sure, I think he is done anyway. Apparently something was blocking the cooling system. Glad you noticed, if you hadn’t it might have melted that part which would have set us back weeks.”
“Weeks?” John asked?
“Yeah, replacing the part is not the problem. But if it had melted we would never have been able to tell it was just a small assembly mistake. Would have taken them weeks to either deduce that or at least allow us to continue.”
Ten minutes later the car was back in action. James took it quickly over 90. This time the noise didn’t rise so quickly. John spotted the mechanic walking back in the building. “Hi, what did you do to fix the noise thing?” he asked. “Couldn’t do much, but I pressed some rubber in between the panel.” “Okay, at least we now know what panel we need to attach a bit better.”
So what is operational intelligence?
Wikipedia describes operational intelligence (OI) as:
Real-time dynamic, business analytics that delivers visibility and insight into data, streaming events and business operations.
When OI is described people often mention that the software or system will stream data in (near) real time to the OI system and that this data was always there, such as log data. The OI system simple taps into it. Things like log files, sales order tables etc. As soon as an event is logged into a file, it is being streamed to the OI system which turns it into a indexed event. Like in the example, modern cars have computers on board which log information. Sometimes the logging information is used when the on board computer is connected when the car is being serviced. Usually nothing really happens with it. There are already cars that are always connected to the internet and transporting back health information to the manufacturer. If they logging information is also streamed to an OI system the scenarios above could happen.
For IT systems it is a bit easier. Those are these days usually already connected. And they continuously log information, which indeed is usually also ignored. If it is used at all it is used to troubleshoot failures. But the troubleshooting without OI is a painstaking process where the engineer has to log in to many systems, interpret and find the bits of information him or herself, combine by hand.
Actually using the information is a painstaking and slow process. If you know exactly which system or even part of the system is failing it is not so hard, but it does involve logging in to that system, opening the files and searching for the exact information.
However if you know that ‘something is wrong’ in your systems you end up opening many files, looking for clues and investigating further.
Operational analytics software is software that continuously and in near real time will gather the information as it is being logged, indexes the information and store in a central place.
By itself that is not such a big deal. We’ve had centralized logging for a while. What sets it apart is that as part of the indexing every item being logged is turned into ‘an event’. And these indexed events now are searchable. We now can actually ‘Google’ our systems to find the issue.
It offers much more. You cannot only find issues, but you can see how many errors and other events are logged in graphs, when they started, what the historical values are etc. You can also look for other things, such as successful events. Many questions raised that in the past were often hard to answer, took time to answer as you had to wait until it ended up in our BI systems can now be answered in seconds.
In the example above. In theory John could have requested all the log files. He would then have to have opened them all and be a genius to notice a rise in oil levels dropping. The step of narrowing down to one particular model would have taken some serious excel skills combined with a lot of time to gather the information. Similar the steps to combine this with other data such as when cars were last serviced and showing that in a timeline would have been a daunting task. With OI it is possible to get that insight and overview in minutes. The software itself produces it in seconds.
And what about testing, wasn’t this meant for operations teams? When someone in operations is proactively looking for issues, finding their root causes and solving them they do pretty much what many testers do. Especially for exploratory testing (ET). ET relies very much on feedback of the system under test to define the next test. OI gives you much more feedback than just the GUI. There are strong parallels in our activities. And I guess few would disagree that it would be better if we proactively look for issues and find issues in test than after we are already in production.
I hope the car analogy will help people understand better what operational intelligence is and how it can help testing. It is a completely different addition in our tool set.