One of the most ignored, and in my view, one of the most important components of a modern enterprise data and information service is a robust, evolving and intelligent data catalog system.
Specifically, this is more important if service is designed to endorse, support and sustain self-serve with respect to data discovery as well as data usage.
Creating an enterprise data service with all bells and whistles but with no data catalog component is like starting a store of merchandise with plenty of inventory with great looking packaging but with no way for customers to find items they are looking for. As a matter of fact, in some cases, they may not even know items they are looking for ahead of time.
You can still run the store, and perhaps you can still make some money; however, as a store owner or store keeper you need to know what exactly you have in the store and where it is located and what it is used for.
When you are not in the store, you have to also make sure that your replacement has the exact same level of information about your merchandise as you do, something that cannot be achieved effectively without investing significant amount of time and money.
You can deliver goods to a customer by asking a customer what exactly he/she is looking for, how they intend to use that item, and in some cases, if they can explain the look and feel of this item.
Once you have this information, you can either yourself or with the help of other team members try and find items while the customer waits for you. Once the items are found, they can be provided to the customer, that is if customer is still waiting for that item in the store.
Now, at this point, if a customer realizes that this is not what he or she is looking for, you need to start the same process again. And by the way, how about offering customers similar products or companion products? It is almost impossible to achieve this goal within a “tell me what you are looking for, and I will see what I can do “model.
This is true for modern data services, as well. The days when we asked users what they are looking for – and we find them an exact or close match, or if we didn't find a match, we created something new for them – are numbered if not already over.
Modern data services MUST have certain obvious features engineered within the system. A data catalog system that is constantly learning from its users to improve the quality of service is, in my view, critical for the success of any data service.
Perhaps I'm speaking more in the context of modern data platforms and services designed using concepts like data lakes on distributed computing frameworks such as Apache Hadoop, but my belief is that traditional systems built around relational data store technologies, such as Oracle, Teradata or a combination of these technologies can also benefit from a robust catalog service.
Six reasons why I think a data catalog system is important:
Below are thoughts on designing such a system. You can also look at these as a set of requirements for a data catalog system.
More and more organizations are making a conscious effort to create enterprise data services on a data lake architectural pattern on top of Hadoop.
This new way of thinking and new way of architectural patterns offer massive benefit in terms of designing a robust data catalog. Most of the datasets are available within Hadoop and should make cataloging them easier.
There are various data catalog products available that can be used as turnkey solutions.
CKAN is an open source data catalog system that can also be customized for specific needs. This can be further engineered to integrate with systems that are not integrated out of the box.
Microsoft Azure offers a data catalog service that can be used after some basic configuration as well as Amazon Web Services.
Alation’s enterprise data catalog system is also worth mentioning. This product is fully capable of operating on Hadoop and non-Hadoop-based data systems at the same time.
If none of these options are feasible for you, it is not a huge deal to engineer a system using some of Hadoop’s ecosystem components.
I would argue that within the context of an enterprise information management landscape, even though the notion of data catalog is powerful and is an essential component of a strong data service, at some point we must think about moving this notion to the next level by trying to catalog entire information asset sets.
It will be an enormous help if a user can use the same search user interface to find not only data sets but also information presentation systems that meet his or her criteria. If a data catalog system can provide a user list of reports, dashboards, visualizations, widgets, and perhaps, mobile apps available that match the requested criteria, it will not only bring down the development cost but will also facilitate the notion of reusing existing data/information assets wisely.
In general, cataloging systems not only increase the visibility and accessibility of data assets, they enhance the trust in data, authenticity, utilization and effective governance of data.