Four Common Mistakes That Can Make For A Toxic Data Lake

Originally Published on Forbes

Data lakes are increasingly becoming a popular approach to getting started with big data. Simply put, a data lake is a central location where all applications that generate or consume data go to get raw data in its native form. This enables faster application development, both transactional and analytical, as the application developer has a standard location and interface to write data that the application will generate and a standard location and interface to read data that it needs for the application.

However, left unchecked, data lakes can quickly become toxic, becoming a cost to maintain whereas the value delivered from them shrinks or simply does not materialize. Here are some common mistakes that can make your data lake toxic.

Your big data strategy ends at the data lake.

A common mistake is to choose a data lake as the implementation of the big data strategy. This is a common choice because building a data lake is a deterministic project that IT can plan for and deliver given a budget. However, the assumption that “if you build it, they will come” is not correct. Blindly hoping that the data lake is filled with data from the various applications and systems that already exist or will be built/upgraded in the future and hoping that the data that exists in the data lake will be consumed by data driven application is a common mistake.

Enterprises need to ensure that the data lake is part of an overall big data strategy where application developers are trained and mandated to use the data lake as part of their application design and development process. In addition, applications (existing, in development or planned) need to be audited for their data needs and the usage of the data lake needs to be planned for and incorporated into the design.
Enterprises need to ensure that their business strategy is bound to the data lake and vice versa. Without this, a data lake is bound to be stunted into an IT project that never really lives up to its potential of generating incremental business value.

In addition, enterprises need to ensure that the organization does not use the data lake as a dumping ground. Data that enters the data lake should be of high quality and generated in a form that makes it easier to understand and consume in data driven or analytic applications. Data that gets generated without any thought given to how it would be consumed often ends up being dirty and unusable.

The data in your data lake is abstruse.

If attention is not paid to it, data in a data lake can easily become hard to discover, search or track. Without thinking through what it means to discover and use data, enterprises filling up the data lake will simply end up with data sets that are either unusable or untrustworthy.

Best practices to avoid having data in the data lake be unusable is to focus on capturing, alongside the data, metadata about the data that includes the lineage of the data i.e. how it was created, where it was created, what its acceptable and expected schema is, what are the types, how often the data set is refreshed etc. In addition, each data set should have an owner (application, system or entity), categorization, tags, access controls and if possible the ability to preview a sample. This metadata organization ensures that application developers or data scientists looking to use the data can understand the data source and ensure that they use it correctly in their applications.

All data sets in your data lake should have an associated “good” definition. For example, every data set should have a definition of an acceptable data record including the data generation frequency, acceptable record breadth, expected volume per record and per time interval, expected and acceptable ranges for specific columnar values, any sampling or obfuscations applied and if possible, the acceptable use of the data.

The data in your data lake is entity-unaware.

Often, attention is not paid when data gets generated to carefully record the entities that were part of an event. For example, the identifiers for user, the service, the partner etc that came together in an event might not be recorded. This can severely restrict the data use cases that can be built on top of this data set. It is much easier to aggregate and obfuscate these identifiers in the data set.

Similarly, data that is not generated and stored at the highest possible granularity level carries the risk of having its applicability and value diminished. This often happens when less data is preferable due to storage or compute concerns. It can also happen when the logging of data is not asynchronous i.e. the logging impacts the transaction processing of the system.

The data in your data lake is not auditable.

Data lakes that do not track how their data is being used and are not able to produce, at any point in time, users that access the data, processes that use or enhance the data, redundant copies of the data and how they came about to be and derivations of data sets can quickly become a nightmare to maintain, upgrade and adapt.
Without such auditability built into the data lake, enterprises end up getting stuck with simply large data sets that consume disk, increase the time it takes to process data records while increasing the probability that data is misused or misinterpreted.

In addition, if the data lake does not offer additional services that make it easier for consumers of the data to decide and actually use the data, the expected value from the data lake can be severely restricted. Enterprises should consider building and maintaining application directories that track contributors and readers (applications) on the data sets in the data lake, an index of data sets organized by categories, tags, sources, applications etc including the ability to quickly surface related data sets, data sets with parent-child relationships.

As the volume of data grows and the number of disparate data sets grows and the number of consumers that interact and impact these data sets increases, enterprises will increasingly be faced with a data lake management nightmare and will be forced to set aside more IT resources to track and maintain their data lakes. Some simple guidelines and best practices on how data (and its use) is generated, stored and cataloged can ensure that the data lake does not get toxic and delivers on its promised value that was the reason for its creation in the first place.