Though it has been practiced for some years, the mining of unstructured data has recently attracted quite a bit of attention. Most stored data is unstructured and contains a great deal of relevant information. Meanwhile, the available structured data is already being exploited; hence the rising interest in unstructured data.
Most often, what is meant by “unstructured data” is natural language text, but there are other types, such as link data, digital audio recordings, images and video. Each of these represents a very diverse set of potential data sources, such as:
- internal company emails
- business news feeds
- customer complaints
- annual shareholder reports
- computer logs
- social media status updates
- cellphone customers who have called each other
- products which have been purchased together
- legislators who have voted similarly
- medical patients who have come in contact with each other
- customer service calls
- military audio surveillance (for passing vehicles, etc.)
- music files
- digitized recordings of engine noise
- agricultural inspection images
- x-rays of airport luggage
- satellite weather images
- drone photographs of wildlife
- medical CT scans
- security camera footage
- traffic monitoring video
Unstructured data is frequently composed of mixed types – text documents with embedded images or free-form text fields in relational databases, for instance. Separating the various components of unstructured data is sometimes a technically challenging task.
What is termed “unstructured data” actually contains a great deal of structure, but this structure does not conform to the most common data types, which are arranged in regular rows and columns (lists, tables, matrices, etc.), or the typical data manipulations (sorting, summing, indexing, etc.). Even when unstructured data is stored in regular arrays, such as pixels in the rows and columns of a digital photograph, the underlying structure rarely aligns with those dimensions.
Tools which analyze structured data, such as the predictive modeling tools used in data mining, are relatively well-developed and have been highly effective. Tools which deal directly with unstructured data are much less well developed and have more of a mixed track record. Not surprisingly then, a common approach to dealing with unstructured data is to extract structured information as familiar feature vectors, which are then fed to structured analytical tools.
Text mining in particular, very often uses this strategy, broadly following these steps:
- Acquire text data from the source
- Convert to a common format (HTML, Word documents, PDFs to plain ASCII, etc.)
- Delete or re-direct extraneous material (embedded tables, charts, pictures, etc.)
- Eliminate noise words (“of,” “a,” “the,” etc.)
- Reduce words to their stems: “lending” becomes “lend,” “defaulted” becomes “default,” etc.
- Consolidate synonyms
- Extract features – often these are simple statistical summaries, such as counts or percentages of terms from special lists (such as “positive” or “negative” words for sentiment analysis)
- Proceed as usual using structured data analysis tools
Note that most text mining solutions do not try to get the computer to “understand” the complete meaning of sentences and documents; the computer does not syntactically “read” the text. Often, comparatively simple summaries or data representations are used to good effect in this field.
Organizations already collect a substantial amount of unstructured (or semi-structured) data from customers, partners and suppliers, and yet more is available through the media and the Internet – especially social media. Any of these unstructured data sources might be analyzed for correlation to business metrics of interest. Organizations in many fields are profitably exploiting unstructured data today, often with surprisingly simple tools.
Late in 2015, Harrisburg University of Science and Technology hosted Data Analytics Summit II, an analytics conference with a theme of unstructured data. Speakers came from a mixture of backgrounds and presented information on a variety of types of unstructured data. To the best of my knowledge, neither paper nor electronic copies of the presentation materials are being distributed, but video of the presentations may be of interest and can be found at the following Web links:
- Presentation by IBM Data & Analytics
- Presentation by QwikIntelligence, Inc.
- Presentation by WildFig Data