Bigger is better, right? Maybe that’s why a lot of companies claim to have Big Data when they are in fact presiding over data that is medium or even small. People want to have Big Data in the same way that they want a big car or big house. They shouldn’t though and here’s why:
Big Data is a dataset that is so large and/or rapidly changing that it cannot be held robustly with ‘normal’ data approaches. To claim a dataset is ‘big’ is equivalent to saying “we can’t cope” with this dataset. Perhaps if everyone looked at Big Data this way, fewer people would be keen to lay claim to it.
Another myth is that Big Data approaches are newer and better. In fact, they are very old. They preceded the current ‘normal’ approach and were rejected in their favour for good reasons. You would not want small or medium data organised as though it were ‘big’ because it would be slower and less robust. Big Data approaches are effectively a last resort when normal approaches can’t cope.
It’s important not to be fobbed off by systems experts telling you that your problems are down to Big Data. It might be your own people excusing poor system robustness or it might be external people trying to sell you an expensive alternative.
You can have millions of rows of data and not have Big Data. Millions of rows can be accommodated perfectly well with normal approaches. And while you might have an awful lot of data, you probably don’t have millions of rows.
Excel spreadsheets used to hold just over 65,000 rows in each tab. For most people, an Excel workbook with multiple full tabs feels like a lot of data. At the risk of stating the obvious, ‘small’ data is a lot more than you could fit into Excel.
It’s not ‘small’ by human standards, but it is small by database standards. You may personally be drowning in data, but that does not mean your database can’t cope.
Normal data is semi-static in that a large proportion of the data remains the same from day to day. Rows may be inserted and fields may be edited, but most of the dataset that existed yesterday still exists in the same form today.
Big Data is usually easy to spot because it is so different than regular data. Perhaps it is real-time trading data. Or it might be based on social media – not only large and rapidly updating, but each update bigger than the last.
We’re not talking an overnight feed of 100,000 rows, here. We’re talking real-time updates at hundreds of times per second. You don’t need to be a technical wizard to know whether your dataset has these features.
The second way of spotting Big Data is by noticing that your systems aren’t coping. Your databases might fall over and your user interface might freeze. This is tricky to determine since many problems can lead to poor systems performance.
A very poor way of spotting Big Data, though, is to take it on faith from your systems people or a software vendor. They might tell you that you have a lot of data and you might feel flattered and understood.
‘Yes’, you think. ‘Someone gets the problems we’re having. We are dealing in too much information.’ Beware of this response. Changing your databases over to a ‘Big Data’ structure can cost a lot of money and leave you worse off than before.
If you’ve ever used Excel, you will be able to understand the difference between regular and Big Data approaches.
The standard approach used today is called “relational data”. This was invented in 1970 and is still very much fit for purpose.
Imagine an Excel worksheet. You have letters across the top (one per column), and numbers down the side (one per row). Instead of letters, imagine field names per column. Critically, imagine a unique ID field in column A. This will contain a unique reference, probably a number just like the row numbers.
When you populate the space between row and column headings, you are structuring your data in a cross tabular format where you can read across the rows and down the columns. The rows and columns relate to each other. This Excel sheet is equivalent to a table in a relational database.
Multiple tables (worksheets) make up your workbook (database). There may be links between fields in different tables, just as there may be formulae connecting different tabs. Your Excel workbook is a perfect example of a relational database.
Relational data is a powerful approach in a database for the same reason that crosstab data is useful in Excel. In a well-designed database, you can read off the data you want using column name and row id.
Let’s say you have a table containing staff details. Staff ID, name and all sorts of descriptors are fields in a single table. You can look up data any way around: find staff by ID and look up an attribute or look up an attribute to find multiple matching staff.
You don’t need to look up an ID in one table and then reference several other tables to find your answer. This is much slower and less robust. And it is how Big Data approaches work.
When you add a new row to a relational table, such as the Staff table, you need to add all the other fields as well. Let’s say this table has 20 fields (columns). You can’t just add a new member of staff by adding a new ID and name. You need to supply all 20 fields, even if some of them are NULL or blank. So, every new row means 20 new data points.
What’s more, the database expects certain data types for each field. One field is text, the next is a number, the next is a date. So those 20 data points can’t simply be 18 NULLs. You have to get the right kind of ‘blank’ for each field (column).
In a rapidly expanding database, 20 new data points per row is a lot. Too much, perhaps. What you need to do is identify the data point that is rapidly changing (social media ID, for example) and isolate that in a minimalist table. You eject all the spare descriptive fields and only include the ID and message content, for example.
Imagine a new Excel workbook. Now the only two columns that are ever populated are A and B. Column A holds a unique reference (a ‘key’), but it needn’t be a number ID. One record might be text and the next might be a date.
Likewise, column B can contain multiple values and datatypes (the ‘value’). These non-relational (or key-value pair) databases aren’t fussy about datatype. Empty data points take up zero space rather than needing placeholders. It makes these databases very flexible and highly efficient in terms of memory. But retrieving data becomes a whole lot more complicated.
Let’s say you’re looking at Twitter data, and you need to know what kind of device a message was posted on. You may only want to retrieve smartphone data, for example. The main table only holds a unique reference comprising the username, date-time stamp and the tweet content.
You wouldn’t want “device” in this table because it doesn’t change that often and so would waste space. “Device” might be held in a different table or it might not be stored at all.
You can’t just read off the data as you can with relational data. You might need to look up data in multiple tables. This makes data retrieval slower. But on the upside, it makes data retrieval possible in cases where relational databases wouldn’t be able to cope.
Unless you deal in fast data, you probably don’t deal in big data. And you should be pleased about this. When it comes to data, bigger isn’t better. Big Data approaches allow us to cope with a quantity and speed of data, but they have significant drawbacks. For most purposes, relational databases remain the best tool for the job.