Managing PII in DataHub: A Practitioner’s Guide
PII and its importance
Every day, the world produces 5 exabytes of data (Source: Accenture). That’s equivalent to 2.5 quintillion bytes, 2.5 billion gigabytes, or roughly 145 million digital copies of James Cameron’s Avatar, my favorite movie.
Among this data produced often lies PII, or personally identifiable information. The NIST defines PII as Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. In other words, PII is any piece of information that could be used to infer one’s identity.
Recent data points to an unfortunate increase in PII breaches at the companies we interact with on a daily basis, for example less than a year ago at Microsoft, and one even more recent at Okta.

Not this.
It goes without saying PII is incredibly important and valuable to a company, and an emphasis should be placed on ensuring this data is properly managed, so it can be properly protected. In this article we will walk through how to properly annotate your datasets that contain PII, and then describe some of the powerful use cases DataHub offers once annotation is complete.
How to annotate data as containing “PII” in DataHub
Step 1: Build and ingest your business glossary
In order to begin annotating datasets that contain PII, we first need to create and ingest a glossary of terms which contain the various types of “Personally Identifiable Information” your business or organization collects. Because this will vary from organization to organization, each of your glossaries will differ. However, we have prepared a utilitarian glossary containing many different types of PII and their “levels” of sensitivity/impact according to NIST standard to get you started — all you have to do is upload this to your DataHub instance!

Examples of some of the PII terms we’ve included.
How does this glossary file work, you ask?
In DataHub, it is possible to create multiple glossary term “sets” which can interact with one another. For our case, this is helpful because it allows us to associate our PII term “set” with our Impact Levels term “set”, giving each PII glossary term an associated level of impact if it was accidentally disseminated. This enhances the experience for end users, as they are now able to filter by Low, Medium or High impact, vs. having to filter by PII term, of which there can be dozens. For more information please see here.

PII term set
Furthermore, if down the line there are reasons to change a PII term from one level of impact to another, (i.e. email now is a “High” impact term and it was originally “Medium”), all datasets that contain emails will auto-switch from “Medium” to “High” impact level with a small edit to the .YAML file. Don’t take my word for it, try it yourself in your own instance.
Please feel free to use the above glossary as a starting point, and edit to your liking. Our .YAML recipes are easy to use, and we recommend that you keep your glossaries as a checked-in artifact in your tool of choice to keep track of changes over time.
For more information on business glossaries in general, why they are important, and how to use them in DataHub, see a previous article here.
Step 2: (Annotate the data) either automatically or manually
So you now have your business glossary ingested into your DataHub instance. Nice work. Now, in order for the glossary to be useful, we need to actually attach the PII term to datasets. This can be done in a few different ways:
“Shift-left” via annotations in schema languages(**Recommended**)
The DataHub community has pioneered shift-left practices for annotating schemas in CI/CD pipelines with the right glossary terms. Zendesk demonstrated how they have done this with Protobuf schemas and Saxo Bank has shared their automated approach for applying glossary terms. As we speak, other companies in the DataHub community are adding support for Thrift and other schema languages. We recommend this “Shift-left” approach because it minimizes the time teams have to spend on manual data enrichment.
Using “Transformers” and pattern matching during metadata ingestion
In DataHub, you are able to create a transformer that can automatically add glossary terms like PII to datasets prior to its ingestion into your instance. For more information on how this would look, please see here.

Specify Regex patterns to determine which glossary terms to add
Via the DataHub UI
The DataHub UI supports adding glossary terms to datasets as well as individual columns with a few clicks.
Acryl Data Auto-PII Detection
Acryl Data partners with Vendors who produce PII detection using machine learning and has approval workflows for humans to verify proposed terms. For more information, please visit our website and fill out a form.
Coming Soon: CSV Ingestion of Associations
In the next DataHub release we plan on adding a new .CSV based ingestion plugin that will allow you to more easily programmatically annotate PII to existing datasets and schemas. End users will be able to list which PII terms to associate with each dataset or column in a .CSV file, and our plugin will do the rest!
Common Use Cases in DataHub
Congratulations on making it this far. Your glossary term sets are ingested, and a large portion of your data is now annotate as containing PII — what next? Here are some use cases that make the work you did worth it!
Search and Discovery with PII
End users can now quickly answer the following questions that they were not able to before, such as:
Where is all of the PII data residing in my data stack?
Which datasets contain emails in my data stack?
Is this dataset safe to send, or does it contain PII?
…and more!
Dataset Downloads
End users should now be able to download .CSV files of their PII datasets, all from a couple clicks from the DataHub UI. Watch video.
API calls for access control to datasets
Now that a good portion of your datasets are properly annotate as containing PII, you can now use DataHub’s friendly API to begin governing access control in your provider of choice.
Metadata analytics sanity check
Navigate to the “Metadata Analytics” tab to validate the amount of datasets you have annotated as PII, and glean additional insights into data ownership and more.

Conclusion
It is becoming increasingly important for data-gathering companies to keep track of PII to ensure their end users’ data is properly handled. I hope this article has served as a useful introduction to PII, proved its importance, and shown you how to leverage DataHub to properly track it.
DataHub does a lot of other things, too. See here for more.
Have a success story using DataHub to manage PII? Write to me at feedback@acryl.io
Acryl Data is hiring, click here for more information.