A Data Lake Journey From Truth to Trust

A Data Lake Journey From Truth to Trust

The term ‘Data Lake’ has become synonymous with almost every data initiative in any enterprise. This makes sense from the inherent benefits it brings like lowering the total cost of storage and eliminating the need for archival.  Plus, Data Lakes provide a flexible schema and serve as a workbench for searching newer patterns and generating value out of raw data. The ability to manage data fidelity takes a major burden off of the Audit & Compliance departments of various telecomm, banking and insurance enterprises.

That being said, there are always two sides of every coin. With Data Lakes crossing petabytes of storage, it poses some fundamental questions from a business user perspective:

  1. What data do I have?
  2. Does this report/model use all possible data sources or variables?
  3. Can we trust this data source? Is this consistent?
  4. Is somebody using this data to make certain business decisions?
  5. Can I really send my sensitive data to a Data Lake? Who all can access it?
  6. How do I integrate so many varied data sources having structured and unstructured data?
  7. Which technology should I use? Do they or their enterprise have required skills? (Data Scientists with advanced data analyst skills are very difficult to find!)

Is this a Data Landfill or Data Lake? 

There is an inherent need to create a framework around a Data Lake that manages metadata (technical as well as business), lineage (N<->S), traceability and audit. Business leaders understand the importance of Data Lakes but also realize the gaps in this concept.  They want to create an ecosystem that plugs the hole from the start.

It is expected that Data Lakes will transform into Data Market Places (like Amazon or Flipkart), that not only give access to its products (data in this case), but also will act as a recommendation platform for business users to guide them to which dataset to use, who should use which dataset, in what combination and what the end product will be.

The DCQI (Data Cataloging and Quality Imprinting) Framework & Engine Rests on Four Pillars:

  1. Data Source Catalogue (Metadata linking ingested data to data sources, to Analytical models, to advanced visualizations, to applications. This is to enable lineage and traceability at any level from and between north <-> south)
  2. Business Glossary (Data classified into respective line of business)
  3. Analytical Model Usage (Data tagged in respective analytical models defined)
  4. Access Audit & Version (When/which version was last used and by whom)

The DCQI Framework enables the ‘write – back’ capability and hence lays the foundation for predictive and prescriptive analytics.

In other words, this will act as a ‘Data Market Place’ governed by a ‘Subscriber and Rules Policy Engine’ to create any application and pull or push any data inside a Data Lake!

Article written by Akshey Gupta
Want more? For Job Seekers | For Employers | For Contributors