Building a Data Storage Scalable Platform

Building a Data Storage Scalable Platform

What is a data platform? How is that different from your "good Ol'Days" MySQL? Why is it suddenly important to approach your organization's everyday data needs with the notion of a platform? How to invest wisely into data analytics without a need to completely re-write all of your data driven apps every 3-5 years?

In the beginning there were files, a lot of files. People were trying to utilize them for all of their data needs. They would store spreadsheets (and they keep doing it even now), data tables, structured and unstructured data, binary data...and anything their wonderful computer programs would produce. And there were business transactions, a lot of them. It was quite impossible to use files for transactions without running into complicated technical issues so people have invented a Relational Database Management System (RDBMS). But of course they would still use files for most of them as the data storage.

Fast forward few decades.

There are still a lot of files and there are still RDBMSs, and these days we are still trying to utilize them for every data need. We use a database even if the data flow is not transactional and our data looks more like an unstructured blob. The same low threshold into the technology which brought files in our everyday operations as ubiquitous data storage paradigm have created another swiss-knife tool in the form of RDMBS.

Yes, Big Data requirements are changing the landscape at the speed of light but there is no guarantee that new jack of all trades will solve our big data problems. - Hadoop and HBase (replace this name with any new buzz word promising speed of light data analytics - Spark, Impala, Drill...) for example are not going to end up the same way as the infamous MySQL database. We are definitely stuck with files for a few more generations.

I think any organization should approach their data analytics needs with the same strategy by applying the best tool for the task, or more precisely, to the same stage of the data analytics flow. Our software tools will change, this is inevitable. There are a lot of various use cases related to the data analysis and I will try to suggest at least the most versatile approach.

Let’s review the building blocks of the proposed platform.

I consider the most important trait of any platform is the level of isolation and decoupling it can achieve. This is why we should be talking about the Services Oriented Architecture (SOA). This alone would allow any organization to replace any component at the end of life or maintain it properly without worrying about the possible downtime. The SOA is definitely not something new but you will be surprised how many organizations are undermining its value because of the extra layers and overhead it adds. Good news though - every modern data tool or storage engine knows about SOA and normally has at least some support in the form of REST API.  The support for SOAP is slowly diminishing. A once popular, pure XML RPC protocol has been called overweight too many times and is now irrelevant.

At the center of the SOA data platform, one will find data storage where the type and exact nature would be dictated by the data itself. One can pick the well known MySQL RDBMS for the transactional storage to back up a web-commerce site or MongoDB for the document oriented storage with soft transactional requirements. One can resort to the column oriented HBase engine for the scalable storage of time-stamped observations with a very low need for the update/delete operations. Or it could be hybrid model with transactional data storage and a complimentary NOSQL database. But this is only a part of the deal.

This platform won’t be a complete solution without two other important components. One of them is a distributed scalable data cache. It could be a Couchbase or Redis but it must provide a scalable in-memory caching component for our data driven applications. And of course it must be offered as the service.

The remaining block is the analytical search engine. There are couple of contenders here - Solr and ElasticSearch. Both are very potent but ElasticSearch has an edge with providing an aggregations framework. The need for the search engine comes from the very wide range of the on-demand data queries forming your OLAP offering. Building OLAP data cubes normally requires a full blown ETL tool, highly skilled individuals and very expensive specialized data storage. For most use cases, the integration of the ElasticSearch cluster would answer those needs.

This is your platform - scalable caching, scalable data storage and transactional database for the limited data sets and a scalable analytical search engine. I believe this kind of platform would be sufficient for the data needs of most organizations. 

There are few optional components to deal with corner cases - one may consider a messaging queue (ActiveMQ, RabbitMQ and Kafka) in front of the data storage to address any burstiness in the data flow or aggregate multiple data sources.

Also, data visualization tools like Tableau could be considered for the more sophisticated UI data presentation layer.

Article written by Maxim Grigoriev
Want more? For Job Seekers | For Employers | For Contributors