DataHub to Kickstart Spring (data) Cleaning

This post, part of our DataHub Blog Community Contributor Program, is written by Lisa Stephens, DataHub Community Member and Data Scientist at Uken Games.

In the physical realm, there’s a point where a collection of objects simply becomes clutter. The same is often true in the virtual realm as well. In an ideal world, all of our data would fit neatly into a handful of clean, intuitive, highly extensible data models that are suitable for all use cases. In the real world, data engineering and data science teams often need to compromise to support teams trying to make informed decisions on the clock. We pull adhoc data sets for one-off analyses, spin up short-term, scrappy data pipelines while more robust automation is in development, and more. We build up our own version of clutter.

Eventually, this creates a layer of technical debt — confusion about which data sources should be considered the authority when calculating a particular metric, questions about the meaning of fields that can only be answered by parsing the codebase or tracking down the individuals with the right institutional knowledge, and a lack of discoverability in an overwhelming data lake — that dissuades people from self-serve analytics.

At Uken Games, this is where DataHub comes in.

Who Are We?

Uken Games is a Toronto-based game studio founded in 2009 that develops titles including Solitaire Story: Ava’s Manor, Who Wants to Be a Millionaire? and Jeopardy! World Tour. We use data to deliver a more enjoyable experience for all of our players through diving deep on the player experience surrounding the in-game economy, seasonal events, and overall game balance.

Our DataHub setup is run entirely on ECS Fargate Spot tasks, amounting to less than $400/year in infrastructure costs.

Photo by Kelly Sikkema () on Unsplash

Leveling Up Colour Coding: Filling out the Glossary

There’s no shortage of IDEs or BI tools with the ability to display and search the schema of an entire database; however, finding a relevant-looking field name is only half of the underlying story. When a data consumer is evaluating whether or not these search results are relevant to their needs, there are a number of additional questions they might have:

  • How clean is this data? Does it contain duplicates or nulls?
  • What is the granularity of this data? Does it contain the required level of detail for the task at hand?
  • What time range does this data cover? Is it suitable for historical analysis?
  • Does this field account for X? Descriptive naming is harder than it looks — something that is obvious to one person may be ambiguous to another.

These are not new problems, and plenty of alternate strategies exist for tackling them. Database tables can be organised according to a precious metals system, where bronze indicates raw, unprocessed data and gold indicates fully cleaned, transformed data. Many of the most popular BI tools are built on the core concept of a semantic layer, where domain experts abstract away the required underlying knowledge of how tables should be joined together. Data dictionaries get created every day — they live in manually maintained spreadsheets, in-house wikis, and automated tools.

However, these strategies have the most value when they all live in one place. I need to know all of these things in combination to decide if a particular data source is useful for my needs — if something is gold quality but doesn’t contain the level of detail I require, I can’t actually use it. At Uken, we use DataHub’s Glossary Terms feature for this. The metadata for each table contains a term describing its:

  • Quality (bronze / silver / gold / iced)
  • Retention period (rolling date range, if applicable)
  • Granularity (row-per-X)
  • Data source (internal / third party)

This offers a couple of benefits:

  • These terms can be used to filter search results. This makes it easier for someone to come in and answer a question like “where can I find a gold-quality user-day dataset suitable for constructing a quarterly report about the in-game economy?”
  • Glossary terms have an internal hierarchy: quality-related terms can be grouped together under the same term group. This creates a link between silver and gold glossary terms that can help end users find related concepts.
  • Each glossary term generates its own documentation page. This allows us to go beyond the naming conventions we decided on when setting up the hierarchy and provide additional detail, and helps us answer questions like, “Should a table storing the predictions of an ML model be considered silver or gold?”

Decluttering: Does This Spark Joy?

Garbage in, garbage out.

It’s a data idiom as old as the Unix epoch. There are many reasons that trash data comes to be, most of which are better suited for a detailed discussion on the contract that should exist between data producers and data consumers, rather than a light-hearted blog post about data governance thinly veiled as a spring cleaning analogy.

That said, one of the underappreciated reasons in the mix is that not all trash data started out that way.

Maybe it was never intended for use outside of its domain context. Maybe it was useful for understanding the business landscape at a particular period of time, but market conditions have changed since. It was useful to someone, for something, at some time.

Regardless of the circumstances of its creation, there comes a time in every data set’s lifespan when it needs to be evaluated for relevancy. Otherwise, the benefits of improving data discoverability become a double-edged sword. People can find data they shouldn’t be using and draw inaccurate or flawed conclusions.

At Uken, we use DataHub’s Statistics feature to do this. This allows us to gain insight into which tables are being used, how much, by whom, and for what. When we recently used this feature to evaluate a family of tables that had historically been used by our data science and product management teams, we found that as many as 40% of them — amounting to approximately 100 million rows — were no longer needed,. This information was significantly less visible in an aggregate metric of queries-per-day that we’d been reporting on data monitoring dashboards. When we looked into the details, we found that the only routine queries were data pipelines making routine updates and BI tools refreshing schema. These tables had been useful once — data sets uploaded for ease of access, where the analysis had long since been concluded and actioned, prototypes of a data pipeline that had since changed ownership — but weren’t anymore.

Storage isn’t expensive in this day and age, but it isn’t free either; that’s a lot that can be cleaned up.

Everything Has A Place: Finding Context

Adam Savage of Mythbusters fame once said, “The only difference between screwing around and science is writing it down.” In other words: documentation is king. Documentation is where the problem space was summarized, the specs were written, the exploratory analysis was shared and discussed, and more.

It would be a Herculean task to try and migrate all of this existing documentation into a tool like DataHub, with several challenges including loss of comment or revision history, requirement of users writing documentation to learn markdown, and more. But migration isn’t what’s needed to provide the missing context — a link is.

At Uken, we make use of DataHub’s entity-level About section to do this. Depending on the nature of the table in question, we include:

  • A brief summary of what the table contains.
  • If the table comes in variants of different cleanliness or granularity, references to what those variants are and a summary of how they differ.
  • If the table has been deprecated, a reference to the newer version that should be used.
  • Hyperlinks out to the original spec, analysis, or monitoring dashboard.
  • Hyperlinks out to the corresponding ETL job.

That covers us at the table level, but what about the field level? Setting aside falsehoods programmers believe about time, timestamps remain an incredibly complex thing in the world of mobile games. A business question like “when did this event happen?” is simple on the surface, but when translating it into a data question we have to wrangle with follow-ups including:

  • Happen in what sense? The time that the player’s phone or tablet measured (device timestamp), the time that we were notified about it (server received timestamp), or the time that the data was processed (batch timestamp)? These three things can be very different for both legitimate and non-legitimate reasons, including:
  • The player’s device time is not derived from (or similar to) the network they are connected to, but has been set manually to a past or future date.
  • The player was offline for part or all of the gameplay, and the logs were only sent when they reconnected to the internet.
  • There was a code change in the transformation pipeline, and the data was reprocessed to account for it.
  • When by whose clock? While Uken is a Toronto-based organization with the majority of employees operating in EST/EDT, we have a global player base. One day, as summarized in a morning report to stakeholders, might span several days as perceived by the player.

That’s a lot of context to try and fit into a descriptive field name, and Redshift caps lengths at 128 characters.

At the field level, we use a two-pronged approach for this:

  • If the field is unique, we’ll include a text-based description that provides this context and preemptively addresses frequently-asked-questions.
  • If the field is common to several tables, we’ll point viewers to a shared glossary term. This both allows us to centralize the description and avoid repetition, as well as creates a link between two fields with (potentially) different names but the same meaning.

For our time-based problem above, that involves tagging-timestamp related fields with one of “device_timestamp”, “server_received_timestamp”, or “processed_timestamp” terms.

Summary

Using DataHub to kickstart spring (data) cleaning has delivered a lot of value for Uken so far, with the main highlights being:

  • Using Glossary Terms to tag each data set with a consistent set of attributes that data consumers can filter search results on, speeding up time-to-analysis.
  • Using Query Statistics to identify low-usage data sets that have not been maintained and should be targeted for eventual cleanup, reducing the risk of inaccurate analysis and reducing processing costs.
  • Using Custom Metadata to link out to existing documentation, allowing end users to more easily find the missing context they’re looking for.

That said, we’re still at the beginning of our Data Governance journey. There are places where this workflow is quite manual and ripe for process improvements.

  • Embedding some of this metadata in our existing ETL would allow it to be ingested automatically, rather than added after the fact.
  • Data entities and structures change over time — keeping a data dictionary relevant requires diligent, consistent maintenance. Tools for easily finding new fields/tables that are missing metadata would simplify this maintenance.
  • Cross-linking variants of the same data at different processing stages could be done through DataHub’s Lineage feature.

We’re as eager to learn about best practices and alternate strategies as many in the DataHub community, and excited to see what the future holds for this tool.

Interested in becoming a contributor to the DataHub Blog?

Inspire others and spark meaningful conversations. The DataHub Community is a one-of-a-kind group of data practitioners who are passionate about enabling data discovery, data observability, and federated data governance. We all have so much to learn from one another as we collectively address modern metadata management and data governance; by sharing your perspective and lived experiences, we can create a living repository of lessons learned to propel our Community toward success.

Check out more details on how to become a DataHub Blog Contributor, we can’t wait to speak with you! 👋

Similar Posts