Elsewhere, I have suggested that there are three junctures at which any data mining project may go the most wrong: 1. problem definition, 2. data acquisition and 3. model validation (see the Data Mining and Predictive Analytics Web log). Data acquisition is a superset of statistical sampling, and the text by Lohr is highly recommended for this topic. Model validation is well explained in the literature: see, for instance, Weiss and Kulikowski.
Problem definition involves understanding the business problem and mapping an appropriate technical solution to it. This may not be as simple as it sounds, and it is easy to be naïve about the best way to construct a technical solution which most naturally solves the given problem.
The basic modeling process is to develop a model function which maps input data to output data. It may be tempting to apply this template to problems in the simplest manner possible, but this may not be optimal. Two cases are detailed here to illustrate this point.
The first example is taken from a technical paper by García-Pérez et al. This paper describes a fuzzy logic solution for an agricultural image segmentation problem. The problem is to decide at what level of brightness to threshold an image of crops. The image is taken from a digital camera mounted on farm equipment, and the threshold is used to segment bright parts of the image (plants) from dark parts of the image (background, mostly soil).
The threshold is a numeric parameter which separates light tones from dark ones. Every pixel in the image falls into one segment or the other (light or dark), and the threshold which divides them is adjustable. The challenge is that if too many things are included as bright (again, plants), then distracting items (stones, etc.) will accidentally be classified as plants, whereas if too many things are included as dark (background), then smaller plants will be misidentified as soil. Simply put, the technical situation is that a grayscale image is available as input, and the required output of our solution is an optimal level of thresholding. We won’t delve into the details of fuzzy logic here, but for our purposes the reader may consider the fuzzy logic engine as a model.
The naïve approach is to attempt to calculate the optimal threshold directly from the given image. The manual equivalent of this solution would be for a person to view the image once, then turn the threshold knob to the best setting, in one shot. This solution is simple to describe, but technically difficult to accomplish.
The authors of the technical paper describe instead an iterative solution: The threshold is set to some initial value, and during each iteration the fuzzy logic engine indicates which direction and by how much to change the threshold, based on the segmented image.
The nearest manual equivalent would be for a person to continuously view the segmented image while turning the knob back and forth, trying to improve the resulting image: It should be plain that this is a much easier task, even if it takes a little longer. All that is needed is to model the suggested changes rather than the final answer and to provide a simple looping structure to repeat this process until it stabilizes (when the suggested change is zero).
A manufacturing client approached me to construct a neural network model to recommend lubricants for metal bearings. Different conditions (high moisture, high contaminant, food industry use, extremely low temperature use, etc.) indicate the need for different lubricants. My client framed the technical problem thus: the neural network would accept as input a collection of application requirements and generate as output the optimal lubricant recommendation (chosen from a fixed list).
As in the previous case, the naïve approach is easy to describe, but challenging to achieve.
I proposed a more flexible solution: the neural network would still accept as input the various application requirements, but it would output an “applicability” score for a single lubricant in the client’s catalog. Theoretically, a lubricant which was a complete mismatch for the application would score 0, while a perfect match would score 100. Intermediate values would indicate a graded scale of applicability to the given problem. This neural network would be executed once for each candidate lubricant.
First, all of the available lubricants might score low, indicating that none of them was really a good match for the situation. Simply selecting the best of this bad bunch wouldn’t be helpful to the organization.
Second, several different lubricants might achieve similar, high levels of applicability, but one might be less expensive than the others, despite being only slightly worse as a match to the application. Blindly selecting the most applicable lubricant in this case would be uneconomical.
Finally, even if we had only wanted the best matching lubricant, having the neural network model assess the match of every lubricant to the problem and choose the most applicable among them is inefficient: Neural networks are a natural fit for many problems, but sorting data isn’t one of them. It would be much simpler to have the neural network handle the individual lubricant evaluations and have a simple selection routine choose the lubricant exhibiting the maximum score.
In the data mining world, it has been popular for several decades to chase the latest modeling algorithm. More recently, it has become fashionable to use exclusively whatever software has most recently become trendy.
Notice that, regardless of how fancy, neither of these items would salvage a data mining project which had gotten the problem definition wrong. Curiously, problem definition has received little attention in the literature, despite being the bedrock upon which all other data mining decisions rest.