Insights

Stepping Through the Data with Multiple Regression Modeling

(Read Part 1, Part 2, Part 3, Part 4, Part 5 and Part 6 of this Stepping Stones series.)

“Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth”. ― Marcus Aurelius

In this “Stepping Stones” series, we covered the basic precepts of Multiple Regression Modelling functions and the concept of how Stepwise Multiple Regression flows from those precepts. Before we move on, here is a tip:

In most regression models it is a given that there will be a parameter for the intercept. That is where the regression line crosses the x-axis.  If the Regression Model is for CPU time to be used for capacity planning for CPU growth, then the line of usage is not linear. As the CPU utilisation approaches the higher levels of utilisation, the overhead will increase.

As planning is for growth, it is more critical to have more accurate values at the higher levels so as the capacity threshold will not be passed.  It is desired to be able to accommodate the service levels required.  In order to accomplish this it can be advisable to use a model that does not have an intercept thereby forcing the intercept to be zero.

In doing this the regression line generated by the Regression Model will be slightly inflated, thereby providing more accurate values for the higher levels of utilisation.

Care should be exercised and the model results should be reviewed to verify the results. Alternatively, another form of Stepwise Regression Modelling can be employed. In this case the Model can be broken into segments for various ranges of CPU utilisation or whatever the measure being analysed.

Up to this part of the “Stepping Stones” series, definitions, descriptions and examples of Step-wise Multiple Regression Modelling have been discussed. In this part, some examples and ideas for insightful and highly beneficial directions will be presented.

As a little food for thought, here are a few potential uses of Multiple Regression Models, especially Step-wise Models. Remember that it might be virtually impossible to generate an adequate solution if you are attempting to generate a single stage Multiple Regression Model that is too extended, basically where there are too many variables to solve for in the model (ex: having 20 variables instead of say 5). This kind of equation becomes difficult for the solution to have a high degree of accuracy and confidence.

With a focus on Big Data, but not limited to it, one might look at models to determine instruction set/mix (or at level of usage intensity); user mix; number of CPUs; bus speed; hardware and software vendor; configuration; software release; operating system and special case elements (i.e. new type technology). The focus should be on the specifics of your data, your instruction set mix, your data mix and your operations with your clients.

Before solid state memory was used, rotating disks and tapes were the primary types of storage. These types of media storage had rotation time and access time that was huge in comparison to CPU processing time. Memory that was used for main memory had an exceptionally high level for failure avoidance, in the range of 99%+. The rotational disk and tape had acceptable rates that were significantly lower, somewhere around 80% or less. If the memory used in processors fell below the 99% level it was binned. Then an incredible idea arose.

What if instead of discarding this memory, it was used to emulate disk or tape devices?

First, it would basically be free; second, it would be more reliable; and third, it would be immensely faster.

This turned into solid state devices which have taken over a huge portion of all media.  Now, as with so many stories, there was the inevitable hitch. Methods of measuring usage of such devices did not exist as such. For example, one of the first significant devices was known as a 2305 solid state disk drive/drum. The measurement facility was RMF (Resource Measurement Facility). This is a component of the operating system that measured and controlled hardware resources of mainframe computers. RMF obtains measurements by sampling and it sends out a signal to the devices and gets a response to let the system know if the device is busy or not. The busy times are used to calculate the device utilisation level.

When RMF sent out the signal to the 2305, the 2305 would always come back with a not busy response. Why was this? It was because the 2305 was so fast (relatively) that the I/O request was finished before RMF would send out the request about busy or not busy. This made it look to the system as if the 2305 was idle when in fact it might have been virtually at full capacity of I/O (80-95% or more).

A situation that arises more often is that of I/O bus speed with respect to CPU processing capability.

CPU’s are significantly faster than data transfer via I/O. The problem arises when the bus speed for transferring data is too slow for the CPU and thereby the CPU is forced to wait for the I/O to complete before it can process. In this world of multiple or faster CPUs being added to processors, it can very quickly end up with a maximum benefit which is far below that which might be available. It hits a peak at which adding a faster or additional CPU does not provide any additional capacity.

The reason is simple. The bus speed has been reached and there can be no additional processing until more data can be pushed through. Similar to a traffic jam when a roadway is completely filled with vehicles (bus is full) and no one else can get onto it until some have moved off.

In the next article and conclusion of this series, we will discuss a number of uses for Multiple Regression.

Article written by Dr. Joseph Trevaskis
Want more? For Job Seekers | For Employers | For Influencers