Big Data is a challenge not only because it is big in volume, variety and velocity and its analysis requires expensive IT solutions; but also because data arrives in heterogeneous formats and is therefore not possible to analyze in its ‘raw’ appearance. So before a data scientist can even write a single algorithm, all raw text files, binary files and databases need to be cleaned from clutter and then correctly structured.
If the data source always remains the same, the issue of getting and cleaning the data can be solved with a scrip for periodical data cleaning and thus the process is automated. However, as so often happens on the market (and especially in the IoT), data format and data sources change. Therefore, it takes a lot of time and effort to collect, clean and structure the data; either from new sources or from shifting parameters in data collection tools and automating again.
One problem is that software is expensive and a lot of human interference is necessary in order to parameterize data and create custom metrics. A premise which makes data cleaning even more expensive. The more data formats available, the more time that data scientists need to clean and tidy up the data.
Over the years of working with data, I’ve noticed that different people understand in different ways what exactly raw and processed data are. Many think (incorrectly) that processed data is the data that has already been analyzed and often wonder why it takes so long to clean the raw data. Raw data is the data coming from the original source and it is difficult (often impossible) to use for analysis.
In the Big Data world, the data is mainly from the Internet or GPS. Processed data is the data that has already been cleaned, merged, sorted, sub-set, transformed and is ready to be analyzed. Depending on the industry you work in, there may be defined standards for data processing.
Processed, clean (also called tidy) data is structured so that each variable is in one column and each different observation of that variable is in a different row. There is one table for each kind of variable and multiple tables are linked by a common column. It can now be analyzed efficiently in a quick and seamless manner.
As cleaning and preparing data for analysis are activities not usually seen by decision makers who need the information Big Data can provide, the allocation of resources normally goes to the analysis of data, but not to its cleaning. Thus, this step in a Big Data project is typically unkempt. Therefore, data scientists or PMO’s quite often have to deal a posteriori with ‘bugs’ and the delivery of results get delayed.
If I had to choose an analogy in order to better explain the importance of the data cleaning step in Big Data projects, then kitchen prep in the fine dining industry would be best. In order for a chef to be able to deliver a delicious, well elaborated and presented meal to your table within a reasonable span of time, he or she needs to have all the ingredients prepared for cooking (and even precooked) at the moment your command is communicated to the kitchen.
There is no way a chef can prepare for you a Grilled Pork Tenderloin with Cranberry Sauce within the maximum of 10 minutes that you are prepared to wait for your meal. Simply marinating the meat takes 2-3 hours. The cranberries must be cleaned and cooked and reduced into a sauce, which takes at least 20 minutes.
All these preparations have to be done properly beforehand and are carried out during closed hours by the kitchen staff under the surveillance of the sous chef. This step of the food delivery process is called mise en place and consists of cleaning, cutting, pre-cooking, storing and labeling raw materials such as vegetables, meat, poultry, etc. Thus, the ingredients are processed and conditioned for the service. Without this preparation it does not matter how good the chef is, the Grilled Pork Tenderloin would be delayed reaching your table and would be neither tasty, nor well presented.
Restaurants have learned that the mise en place is crucial for optimum functioning in the kitchen. Similarly, decision makers in the business world need to attribute the correct level of importance to the data cleaning and then assign the required amount of resources to it in order to get the accurate information in a timely manner.