Big data: The needle or the haystack
“Finding the needle in the haystack you didn’t know you were looking for”
I was reminded of this line when I announced that I have the privilege of speaking at Eurostar 2014 on using Big Data software (to be precise: operational analytics software) for testing. When I announced this, I got a response from someone who stated, rather matter of factly, that this can’t be useful in a testing environment, especially for new functions, since in test you don’t have a lot of data…
Big Data is just a lot of little data
When we talk about Big Data we obviously think in large amounts of actual data. The data of one movie on a BluRay disc will easily exceed 30 Gb of data. That’s a lot of data and as a result even home storage systems run in multiple terabytes of storage space. However strictly speaking it is just a few files on such a disc. The media files do contain a lot of data, but typically not the data we are interested in from a Big Data point of few. In fact, the data from (or on) such a disc that is would be indexed and made available in Big Data would be limited since the bulk, the data which describes how to render the picture and the sound, is not interesting for indexing.
Typically, the Big Data we (often) talk about is really just a big collection of small data. This can be data from all sorts of systems. What is big about it is that, with the current technology and storage options, we can and do collect data from so many sources of which many are capable of recording or logging so much data that the total gets to be Big.
What kind of data do we collect then?
Wikipedia describes Big data as:
“Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”
The Wikipedia description is broad. It can be any type of data. These days an enormous amount of data is generated in expected and unexpected places. All sorts of devices are connected to a network generating data as logs. Obvious examples are network devices, but if you look carefully there is much more. For example, the infra-red eyes in shops count the number of entering and leaving customers, elevators log their movements, candy and coffee machines provide stock information, copiers and printers do the same, and we should not forget the amount of logging data in the larger systems of corporations such as service buses, customer relation systems etc.
The common denominator is that this data often was originally intended just for debugging or deep dive analysis and often the data is purged quickly due to storage limitations. But since these devices are connected we can collect this data in a central place before it is removed from the original location, so not only can we keep it longer, we can also correlate data from different sources, index it to make finding things easier and use software to visualize trends to use the data more effectively.
There is software that does just that, it collects data from every source you tell it to, indexes it and makes it available for analysis. Yes, the trigger to make this software came from a perspective to make it possible to do something with all that (Big) data. But does that mean the software needs a lot of data to be useful?
Using Big data
So what could all those log files be used for? There are several ways in which we are now able to make use of all the data that is being gathered. It can be divided in two things:
- The big picture: properly indexed data allows us to see trends, anomalies etc. Often used for marketing purposes but also used for operational analytics to detect sudden spikes in behaviour or errors. For this purpose visualization functionality is highly important
- The needle in the hay stack: with all the data indexed the search engine is crucial for finding that one error that is at the root of a bigger issue.
Both are about analysis on the data collected in our big data system. This can be business analytics or operational analytics. Where business analytics focuses on things like sales and usage trends and operational analytics on improving things like application events and system performance.
In operations, both the big picture and the needle in the haystack are important for improving your operations. It is very helpful to look for trends; if you see a peak of error messages you know this may require investigation. And as you have the data of the past in there as well you would find out, if such peaks happen more often, if there is a trend. But it’s not just errors or time-outs that are interesting, trends in general behaviour are also interesting. For example, if you see certain spikes in usages on regular intervals you can tune your systems to account for that.
To find the needle in the haystack you actively have to look for it. Likewise with finding the cause of an issue. Whether or not you were made aware of an issue by analytics, finding the real cause is often a time consuming process which depends highly on talent and experience as well as on painstakingly digging through all the systems that may be involved to find the one fault causing the bigger issue.
Operational analytics software (based on big data) can be a powerful tool in this search because:
- All the data is one place, no logging into each system separately and digging through countless of files.
- The data is turned into time based events allowing us to look for issues in all systems in a certain time frame.
- Correlation: Seeing the relation between errors in the different systems is extremely useful.
- We can visualize what is happening, which often triggers us to understand what is happening.
- With powerful search and filter options we can search through the data from all the systems and see results in seconds or even milliseconds.
- History: With operational analytics software you can see in seconds how often similar errors occur in the past, what normal behaviour is and even if these errors correlated with similar issues to what you are currently dealing with.
Without the haystack
Let’s get back to the original question on if this kind of software can be used without a huge amount of data. When you test a single application which has no interaction with other systems, the log files themselves are quit useful and usable with just a plain text editor. Only nowadays, we hardly ever find ourselves in such a situation. Most of the software we deal with will communicate with other systems, which in turn will communicate with other systems. The interaction is also where the challenge is. As soon as developed software leaves the system test stage and moves to test phases where it is integrated the issues arise.
The integration is a challenge for all those involved such as designers, developers, testers and even the project managers. The complexity grows, you have to deal with different systems, teams, designs, development philosophies. It becomes harder to grasp and fully gauge the process. And this is where we see the harder and slower to fix issues, where we see delays as well as issues overlooked which end up as issues after go live. The bottom line is: the individual systems will work as designed. The integrated system is much harder to get working as desired.
We try to tackle this in many ways. Interface agreements are made. Interface testing is done. But most of all: we test the full life cycle of business processes. For instance if we as a test place an order, we will follow the order through the entire chain of applications to see if the order is processed and registered, if the logistics system receives, stores and processes the order, if the billing system will do this and we will check if the bill is made correctly. In this simple example we already have multiple systems to check. That means logging in to these systems, often logging in directly into their databases and performing all sorts of checks. And then, if you are really thorough, check the log files of the systems to see if any error was thrown. Mostly however we only check the log files if we have already concluded that something was wrong. I.e. if an order did not end up in one if these systems.
Now image that you can track the entire process through the systems in one view. Where not only the events in the log files are gathered, but also the tables in the databases that store the orders. And where also the data is gathered which is stored in systems that you didn’t realise were also receiving updates from this order. Usually there is a unique business transaction ID used in all the systems which then makes it really easy to get the all the events for that ID in one overview. That means that the majority of the checks you were to perform are now in front of you in one screen. That makes your live easier. But it will also show you anything that you hadn’t anticipated for that process. You get to see errors thrown that you weren’t looking for, such as errors thrown in that loyalty system you were unaware of or an error that alerts you to the fact that order was registered but due to an error has a state in a field you were unaware of. And you would also see errors in the log files that you might only look for if a more obvious issue was noticed.
The benefits in the example used would be:
- The checks anticipated are much less time consuming.
- Checking all the systems is more accurate, as the manual steps checking all the systems can become error prone.
- Anomalies are visible, such as errors in connected systems that you didn’t realize were impacted or even in the systems that you were checking but in a different part of it.
The first two are interesting benefits but the real power is the last one. Those issues are the same ones that often lead to comments after go live such as “How could test have missed this? The logs are full of these errors.”
It is very easy to become biased to expected results as we naturally check for what we expect. We have after all been trained that we can’t test/check everything and have to test smartly. Only as most testers know, the most interesting results are when you notice something weird, something unexpected.
With just a little data, hardly Big Data, the software that may have been meant and designed for Big Data, is equally useful. In this scenario it still possible to do the same by hand. We could open all the log files, search for the unique transaction ID and see if any errors were thrown. We could also do that for systems we considered out of scope. But not only would it be much more time consuming and error prone, chances are you would not be able to go as deep. Much more importantly, we don’t get the time to do that.
So do we need a haystack?
We do need data. But this may be just a little data. The data generated from a simple test readily becomes more than enough. What really matters is the complexity. If you have one log file and one system the benefits are going to be very low. As soon as you have more log files or the system under test gets more complex (like with service oriented systems) this software gets to be really useful!
Searching through the haystack was the reason for this software to be created, but we don’t need the haystack. We need the needle, the haystack is optional.
Biography
Albert Witteveen has been working both as an operations manager and a professional tester for nearly two decades now. The combination of setting up complex server environments and professional testing almost automatically led to a specialisation in performance testing.
He wrote a practical guide to load and stress testing which is available at Amazon. The book discusses how to do performance testing, how to provide real value and how to assess the performance in an objective way. It describes how to perform the tests, what and how to monitor, how to design the tests, how to setup the team and how to report.
In his current assignment as a test manager he employs the lessons learned in operations to make testing, more efficient as well as more effective.
Albert will also be presenting at the 2014 EuroSTAR Conference on Introducing Operational Intelligence into Testing