Though it has been practiced for some years, the mining of unstructured data has recently attracted quite a bit of attention. Most stored data is unstructured and contains a great deal of relevant information. Meanwhile, the available structured data is already being exploited; hence the rising interest in unstructured data.
Most often, what is meant by "unstructured data" is natural language text, but there are other types, such as link data, digital audio recordings, images and video. Each of these represents a very diverse set of potential data sources, such as:
Unstructured data is frequently composed of mixed types – text documents with embedded images or free-form text fields in relational databases, for instance. Separating the various components of unstructured data is sometimes a technically challenging task.
What is termed “unstructured data” actually contains a great deal of structure, but this structure does not conform to the most common data types, which are arranged in regular rows and columns (lists, tables, matrices, etc.), or the typical data manipulations (sorting, summing, indexing, etc.). Even when unstructured data is stored in regular arrays, such as pixels in the rows and columns of a digital photograph, the underlying structure rarely aligns with those dimensions.
Tools which analyze structured data, such as the predictive modeling tools used in data mining, are relatively well-developed and have been highly effective. Tools which deal directly with unstructured data are much less well developed and have more of a mixed track record. Not surprisingly then, a common approach to dealing with unstructured data is to extract structured information as familiar feature vectors, which are then fed to structured analytical tools.
Note that most text mining solutions do not try to get the computer to “understand” the complete meaning of sentences and documents; the computer does not syntactically “read” the text. Often, comparatively simple summaries or data representations are used to good effect in this field.
Organizations already collect a substantial amount of unstructured (or semi-structured) data from customers, partners and suppliers, and yet more is available through the media and the Internet – especially social media. Any of these unstructured data sources might be analyzed for correlation to business metrics of interest. Organizations in many fields are profitably exploiting unstructured data today, often with surprisingly simple tools.
Late in 2015, Harrisburg University of Science and Technology hosted Data Analytics Summit II, an analytics conference with a theme of unstructured data. Speakers came from a mixture of backgrounds and presented information on a variety of types of unstructured data. To the best of my knowledge, neither paper nor electronic copies of the presentation materials are being distributed, but video of the presentations may be of interest and can be found at the following Web links: