Why Data Catalog Systems Matter in Modern Data Services

Why Data Catalog Systems Matter in Modern Data Services

One of the most ignored, and in my view, one of the most important components of a modern enterprise data and information service is a robust, evolving and intelligent data catalog system.

Specifically, this is more important if service is designed to endorse, support and sustain self-serve with respect to data discovery as well as data usage.

Creating an enterprise data service with all bells and whistles but with no data catalog component is like starting a store of merchandise with plenty of inventory with great looking packaging but with no way for customers to find items they are looking for. As a matter of fact, in some cases, they may not even know items they are looking for ahead of time.

You can still run the store, and perhaps you can still make some money; however, as a store owner or store keeper you need to know what exactly you have in the store and where it is located and what it is used for.

When you are not in the store, you have to also make sure that your replacement has the exact same level of information about your merchandise as you do, something that cannot be achieved effectively without investing significant amount of time and money.

You can deliver goods to a customer by asking a customer what exactly he/she is looking for, how they intend to use that item, and in some cases, if they can explain the look and feel of this item.

Once you have this information, you can either yourself or with the help of other team members try and find items while the customer waits for you. Once the items are found, they can be provided to the customer, that is if customer is still waiting for that item in the store.

Doesn't sound much like self-serve, does it?

Now, at this point, if a customer realizes that this is not what he or she is looking for, you need to start the same process again. And by the way, how about offering customers similar products or companion products? It is almost impossible to achieve this goal within a “tell me what you are looking for, and I will see what I can do “model.

This is true for modern data services, as well. The days when we asked users what they are looking for – and we find them an exact or close match, or if we didn't find a match, we created something new for them – are numbered if not already over.

Modern data services MUST have certain obvious features engineered within the system. A data catalog system that is constantly learning from its users to improve the quality of service is, in my view, critical for the success of any data service.

Perhaps I'm speaking more in the context of modern data platforms and services designed using concepts like data lakes on distributed computing frameworks such as Apache Hadoop, but my belief is that traditional systems built around relational data store technologies, such as Oracle, Teradata or a combination of these technologies can also benefit from a robust catalog service.

Driving factors behind an enterprise data catalog system

Six reasons why I think a data catalog system is important:

  1. Allows users to find what data sets are available within the organization
  2. Allows users to find most appropriate use of datasets
  3. Allows users to understand how other individuals and teams are using the dataset
  4. Allows DevOps teams to manage an inventory of data assets
  5. Increases visibility around dataset usage, enhancing the possibility of transparent audit and greater implementation of governance policies
  6. Data catalog systems can also be used to demonstrate the credibility and authenticity of datasets, which in turn should increase user confidence in datasets

A robust data catalog system – design principles

Below are thoughts on designing such a system. You can also look at these as a set of requirements for a data catalog system.

  1. System should provide data stewards and owners a way to onboard a dataset within the system. This can partially be done via an automated data pipeline that requires owners to review/change/approve the profile of dataset and add additional metadata tags.
  2. Users should be able to use an interactive UI to search data assets. Users should be able to identify certain datasets as “Favorites” or “Follow My Lifecycle”.
  3. System MUST honor the data privacy and security to protect specific dataset. For example, depending upon security rules, a user should be able to search a dataset, look at the data profile and then should put in a request to gain access. If a dataset is so sensitive that certain users should not even be able to see them, they should not be visible in search results.
  4. System should allow users to see dataset’s history, best possible use, data quality rating and information about owner or steward and ways to request data access.
  5. System should also allow users to subscribe to dataset profile updates and changes.
  6. System should allow users (could be systems, as well) to rate datasets for various criteria (usefulness, speed to access, data quality). Tis rating should be available to new users who seek to use this dataset.

Implementation options

More and more organizations are making a conscious effort to create enterprise data services on a data lake architectural pattern on top of Hadoop.

This new way of thinking and new way of architectural patterns offer massive benefit in terms of designing a robust data catalog. Most of the datasets are available within Hadoop and should make cataloging them easier.

There are various data catalog products available that can be used as turnkey solutions.

CKAN is an open source data catalog system that can also be customized for specific needs. This can be further engineered to integrate with systems that are not integrated out of the box.

Microsoft Azure offers a data catalog service that can be used after some basic configuration as well as Amazon Web Services.

Alation’s enterprise data catalog system is also worth mentioning. This product is fully capable of operating on Hadoop and non-Hadoop-based data systems at the same time.

If none of these options are feasible for you, it is not a huge deal to engineer a system using some of Hadoop’s ecosystem components.

Imagine this:

  • Your enterprise data service runs on a Hadoop-based processing engine. A data ingestion layer ingests all datasets.
  • When datasets are ingested, they go through basic data curation and cleansing. Now, certain stewards provide (via UI, configuration file) some basic information about that dataset in terms of:
    • Source and authority
    • Data quality metrics
    • Identity of dataset
    • Best possible usage
    • Connected endpoint
    • Steward and owner
    • Governance and security
    • Classification
    • And much more
  • This information is stored in Apache Hbase along with summary of data as a data profile.
  • Stored profiles are then indexed/re indexed with the help of Apache Solr.
  • A web (and mobile) front end, designed with the help of SOLR APIs, offers users to search for various datasets available within your system.
  • Your most rudimentary and basic data catalog system is ready to be used. This small app can be a great way to experiment with catalog processes.

Data catalog to information asset catalog

I would argue that within the context of an enterprise information management landscape, even though the notion of data catalog is powerful and is an essential component of a strong data service, at some point we must think about moving this notion to the next level by trying to catalog entire information asset sets.

It will be an enormous help if a user can use the same search user interface to find not only data sets but also information presentation systems that meet his or her criteria. If a data catalog system can provide a user list of reports, dashboards, visualizations, widgets, and perhaps, mobile apps available that match the requested criteria, it will not only bring down the development cost but will also facilitate the notion of reusing existing data/information assets wisely.

In general, cataloging systems not only increase the visibility and accessibility of data assets, they enhance the trust in data, authenticity, utilization and effective governance of data.

Article written by Manoj Vig
Image credit by Getty Images, DigitalVision Vectors, polygraphus
Want more? For Job Seekers | For Employers | For Contributors