Data, more often than not, does not provide a completely obvious explanation to what it contains. Properly utilised, regression techniques can be a most powerful tool at unlocking and revealing details that are obscured within data.
Regression is a way to calculate a line through data points and so predict behaviour like projected sales or usage levels of resources. To get the most reliable estimates, obviously it helps if you have the value of one of the coordinates. For example, if you know the value of x, the resulting line for a given scenario will show the y value on a graph.
A finely tuned athlete or perhaps someone who is an expert at a martial art can develop to the point where they can regulate their breathing and even heart rate. Even at the highest level, they still are not in complete mastery of their mind and body. Some part remains less than perfect or else no further practice and competition would be necessary. Computer and other systems are no different.
In reality, it is virtually impossible for a system to monitor itself with complete accuracy. In fact, with business applications, the more a system monitors itself, the less it is available to devote to the task it was designed for.
The consequences are that approximation and sampling are done to derive a close proximity of the value. The more interactive the system or application is, the less precise will be the recording in terms of what process is using what resources. This is due to functionality and purpose.
Take a system designed for transactions, especially high volume and speed transactions like a trading floor, customer service desk or fast moving consumer goods. Typically these systems are designed for high speed handling of queries rather than recording usage of resources to perform them.
The use of regression models can be the technique that will unlock what is hidden within the raw data when you only have some of the metrics. In previous articles, there has been discussion of the use of quantitative methods yielding misleading results.
So how can we ensure that certain quantitative techniques are helpful when others are not accurate?
The answer lies in how they are used. Put simply, if the algorithm or equation is properly designed and the approach thought through correctly, then the results will be reliable.
In the situation described above relating to data recorded only having partial information, the use of regression models can prove most powerful and the solutions can be the key to accurate measurement, billing and planning.
We can look at the example of a series of batch jobs being processed on a mainframe computer. While the job is processing, much of the usage – what process is using which resources - can be clearly identified and measured. We can see file allocations, input and output (I/O) activity, memory usage and the majority of the CPU time and know pretty accurately which job used them. But not everything is precisely measurable. For one thing, operating system overhead is not fully allocated. The mainframe operating system does have recording information as to total CPU time consumed. It also has time used by various components. The problem is the sum of the parts does not equal the total.
To provide the answers to the questions of actual usage, other activity regression techniques can be used. In the next article, the use of regression will be explored.
Q: How many economists does it take to change a light bulb?
A: Seven, plus/minus ten.
Check out Regression Models - Data Analysis Friend or Foe - Part 2
Article written by Dr. Joseph Trevaskis
Want more? For Job Seekers | For Employers | For Contributors