Why do you need a data dictionary?
Simple data discovery starts with good organization and a plan to build out definition database. A data dictionary is a list of key terms and metrics with definitions; a business glossary. Although this seems like a simple exercise, it’s very difficult to align business departments with the same definitions. A ride-sharing company we wrote about in a previous article shared an example of the difficulties related to data definitions. At this company, it was very difficult to get aligned on the same metrics for “number of rides a week”. Why?
- The data team defines the “number of rides per week” as the total number of rides that were completed between Jan. 1, 2020, 12:00 AM → Jan. 7, 2020, 11:59 PM.
- The marketing team defines the “number of rides per week” as the total number of rides that were started between Jan. 1, 2020, 12:00 AM → Jan. 7, 2020, 11:59 PM.
- The sales team defines “number of rides per week” as the total number of riders that paid for a ride Jan. 1, 2020, 7:00 AM → Jan. 8, 2020, 6:59 AM
All data-driven organizations experience this problem as they begin to grow their data and people. And although it sounds like a simple problem, which might require a meeting to solve, aligning the business and data to remove confusion can be an extremely profound problem. That's why a data dictionary can be one of the most valuable tools that a data team can create to deliver results.
Below are the steps and data dictionary best practice that teams need to take when creating a data dictionary:
What is a data dictionary?
A data dictionary is a living document that helps explain the context and meaning of your organization's data. Business analysts and domain experts should work together to create and maintain it throughout the lifecycle of the project.
1. Gather terms from different departments
The first step data teams need to take is to collect the different terms that all departments are using to define key business definitions. Today, this collection could be done in a spreadsheet with a list of the business concepts. One approach teams take is to look at a sample of the most used reports and dashboards by teams to get an understanding of the main metrics used across departments. This can be done easily using Secoda, which automatically indexes the most used resources by an organization. Once teams have a good understanding of the commonly used tables and dashboards, listing out all the primary axis by charts can help to understand the term as well as the definitions.
- For example, a dashboard showing the rides completed by state is composed of two primary terms: "rides completed" and "state". The teams should compile a list of these two terms.
- One way a tool like Secoda can help with this stage is to use the "tag" feature to categorize which reports involve the "rides completed" and "state" terms. Additionally, Secoda automatically documents the data type, and examples of the values, additional columns in the table, owners, and a link to the report.
Once you have your list of terms, you can begin grouping them by the functional unit. For example, the marketing metrics, product metrics, operations metrics. Another suggestion is to categorize the similar dimensions that are repeated throughout the tables (product_id, country, year, etc.) This is helpful because it some columns could have names that are difficult to understand. Making a list of these definitions can help data teams standardize some of these confusing terms. After you prepare this list, ask the teams to go over their section and add any terms that you might have missed. Once the terms have been collected, you should be ready to define the terms
2. Give the terms a definition
Data teams should try to define the terms collected based on the information that they can collect on the terms. To do this, the data team should pull any existing definitions from outdated documentation. Most teams keep this documentation in google sheets or confluence, which can make finding the definitions a tricky part of the process. Another way to find the definitions is by looking through annual reports, or from actual code such as SQL queries or Excel. Names of terms should always be written out and unabbreviated to make sure that they are universally understandable. Even non-technical staff should understand the definitions of rows or columns, even if they refer to some other definitions in the database.
After this initial draft is complete, schedule time to sit with the teams and ask them if there are inconsistencies, inaccuracies, or definitions that need to be refined. This part of the process can take time and can require multiple meetings to clear up. When our team started working on this step, we found that a lot of the meeting was spent explaining "why", as opposed to talking through solutions. One strategy that helped make these meetings more efficient was to create detailed documentation before the meeting. This way, all stakeholders came into the meeting prepared and informed about the discussion.
One additional tip is to make sure the stakeholder understands that the goal of the meeting is to find out how the metrics should be defined, not how it was defined in the past. If the current definition is not perfect, this is a great chance to correct it and align teams on the same definition. Remember that overly complex metrics are rarely the right solution. One rule we like to follow internally is to always remember Goodhart's law. This law states that when a measure becomes a target, it ceases to be a good measure. We don't want to create measurements that are too complex and incentivize people inappropriately.
3. Find alignment
In this step, it's important to look for similarities or dependencies in definitions across teams and resolve conflicts. Multiple teams may have similar definitions. For those teams and terms, bring them into the same meeting to align the teams towards one definition. When you're setting up the meeting, clarify the outcome of the meeting to get one standard definition that works for both teams.
- This might require one team to concede to another teams definition
- This might require both teams to agree to a new name for the definition.
- This might require both teams to rework their existing definition to find a completely new definition.
One suggestion is to make sure that names and definitions avoid and ambiguity. Use longer, more descriptive terms when they add more clarity. Your goal should be to eliminate any confusion about the definitions.
4. Get support and sign off
Once these meetings conclude and teams are on the same page, have the team leads sign off on the new definition. The team leads should be in alignment with the definitions and feel like the data team worked collaboratively with them to come to this definition. One effective strategy is to involve the leadership team in the exercise early to make sure that their team leads are signing off on the definitions. If the team leads see the value of having alignment, this can move at a much faster pace.
5. Centralize the document
After you receive sign-off from team leads, publish the data dictionary as a document that can be accessed by all employees. The definitions should be understood by anyone in the company, not just the data team. Additionally, the definitions should be adopted by all teams and by leadership. Because of this, getting people to start using the right definitions is a tricky part of the process. The individual definitions should be baked into the reports and visualizations whenever possible.
One of the tougher tasks is to manage the data dictionary after it's been published. One solution to this is to try to autogenerate the data dictionary definitions from a single source, such as a code repository. We believe that a tool that integrates into the database or code is a great way to manage the data dictionary.
6. Upkeep the data dictionary
Although the key metrics should be stable, they may need to change over time. One instance that might require key metrics to change is when a new revenue stream is introduced or when the pricing of an existing revenue line changes. These changes traditionally come from the business team and might require the data team to implement the changes into the data dictionary.
When there are big changes in the business, treat the changes to the data dictionary as a key part of the product release. Communicate the changes in the dictionary definition to the rest of the teams and make sure all team leads are informed about the new changes. One strategy that we used was creating a slack channel dedicated to updates to the data dictionary. This way, all business stakeholders can stay informed on the important key business definitions.
Creating a data dictionary is not a small undertaking, that's why we make it easier with Secoda. it requires patience and alignment with leadership. This process will likely take a few months, even with the proper tooling. That being said, having the proper tools in place to record the dictionary can make maintaining and accessing the dictionary a much easier process. We suggest tackling the process head-on and early on. Many companies wait a long time before defining their data dictionary, which can make the process much more difficult in the future. We also have a data dictionary template in our data dictionary template.
Desirable characteristics of a good data dictionary include timeliness, consistency, accuracy and completeness
- Timeliness is critical in this data-driven era. As the saying goes, “data is the new oil” and “data is the currency of marketing.” You must ensure that information in your data dictionary has been updated to reflect any changes made to your database.
- Consistency ensures you are using a single definition for each item recorded in the data dictionary. For example, if you have demographic data about your customers and residents, do you use the same label (or terminology) when referring to similar items? For instance, if you create a column for each customer's “employment status,” be sure all future usage of that term also refers to employment status and not income level or education level.
- Accuracy means that information in your data dictionary is an exact reflection. To be accurate, your database metadata must correspond with actual database content; it should contain no errors or inaccuracies caused by sloppy notation or cut-and-paste mistakes from one entry to another.
- Completeness means that information has been recorded thoroughly without omissions and coincides with relevant data captured elsewhere in the organization's databases. In other words, there should be no gap between entries in your data dictionary and what they represent within other databases.
The goal of a data dictionary is to enable effective use of your data assets
Whatever you call it—a data dictionary, a glossary or a metadata registry—the goal of this document is to enable effective use of your data assets by ensuring you have an accurate understanding of what they mean.
A good sql or database data dictionary will contain: A description of each business term that has been defined in the model. For example, is "customer" defined as a person who bought one product in the last 6 months or someone who made three purchases in the last year? The business definition should be documented along with other relevant attributes (such as date range).
Information about how specific terms are being used by different departments (and perhaps even different groups within those departments).
The technical details behind how new terms are generated and how they are originally populated into a database or software system.
Try Secoda for free
Teams that invest the time to get alignment on their data can see major benefits of data dictionary in the long term as they make faster decisions as a team. Our team is working towards a tool that helps teams align on key business metrics and access their data from a single dashboard. If you're interested in trying out the tool, you can sign up to try the tool for free here