How to Automate Data Documentation?
Let’s face it— one of the necessary but least exciting parts of working with data is dealing with data documentation. Whether you’re a data analyst who’s responsible for creating the documentation, or a member of the business intelligence team trying to navigate a database, the data documentation experience can be a difficult one.
At the same time, there’s no denying that formal data documentation is essential to any organization, especially one that’s scaling. Keeping this knowledge consistent and regularly updated is essential for onboarding new members and ensuring high data quality. Unfortunately, so many data teams find themselves inundated with requests and questions from stakeholders and people outside of the data team. So, data documentation falls to the wayside, and as the volume of data grows, it becomes more and more difficult to work with.
Automating data documentation is an obvious solution to the problem that every single person working with data faces. It removes the manual work of maintaining the documentation and creates a consistent process for doing so, ensuring reliable and trustworthy data and insights across the board. Teams who haven’t already started automating their data documentation are missing out on serious time, capacity, and data literacy opportunities.
In this blog, we’ll cover:
- What data documentation is and why it’s important
- The current data documentation landscape and why it doesn’t work
- Benefits of automating data documentation
- How to automate data documentation
What is data documentation?
Data documentation is a description of anything in a company’s data knowledge: existing data, databases, warehouses, tables, and resulting graphs/charts, metrics, queries etc. It’s a broad term to describe the different ways that context can be provided on data. Both data producers and consumers should be able to understand the data documentation. For example, common data documentation might be the date that data was created, the source it came from, how it’s structured, etc. In other words, data documentation makes it easier to work with data and ensures that there’s a mutual understanding of how it’s organized. Despite this important function, there’s not always a priority to invest time or effort into data documentation due to conflicting or higher priority items arising.
What is the status quo of data documentation?
On one hand, data engineers and analysts are putting on a balancing act– they are responsible for the data architecture on top of fielding requests from external stakeholders and maintaining the database itself. Data documentation, if it happens at all, is not a priority for many data teams, and if there is a process in place, it tends to be strenuous, manual, and isolated from the rest of their data workflow. As a result, the documentation is at risk of becoming outdated. The problems that arise from lack of formal documentation only increase as the company scales and ingests more data while housing historical data. And, with the rise of remote and hybrid work, turning around and tapping your data team on the shoulder to ask a question is impossible. Getting the necessary context you need to understand the data is that much more difficult.
On the other hand, stakeholders and teams outside of the data organization, such as sales, marketing, and product, have difficulty navigating the data and databases because the documentation isn’t straightforward or easy to find. Perhaps it doesn’t exist at all, and these stakeholders need to ask someone directly on the data team questions about data insights. A common issue is that there is duplicate documentation that contains conflicting information— for example, one source might say that revenue is measured using XYZ, while the other says that revenue is measured using ABC. Again, as a company scales, the questions and hand-holding only become more plentiful, and likely redundant.
Currently, data documentation isn’t typically automated. It’s often segmented and lives separate from the data workflow– such as a Confluence document or even Google Doc. The typical, scrappy solution would be to copy and paste information from the data warehouse into the document. Unfortunately, data collection, ingestion, and changes happen faster than a human can copy and paste into a document, so any resulting data documentation would be inaccurate and out of date.
The result of this lack of automation means that there’s no central source of truth. Perhaps one team is using a Confluence doc or a Google Sheet, while the other is still referencing context provided on a Tableau dashboard that was updated months ago. And, with a lack of process and automation, those who are responsible for data documentation (the analysts and engineers) have less and less incentive to maintain the documentation, thinking there’s no point since it’ll be out of date soon enough anyway.
How does Automating Data Documentation Solve This?
The benefits of automated data documentation include:
- Removing manual work of updating existing documentation
- Documenting data only becomes more difficult as the volume of data increases, which makes manually documenting it a time consuming and strenuous process. By automating documentation, you can free up more capacity from your data team to focus on bigger picture projects and rest easy knowing that as data changes, so does your documentation.
- Standardized documentation to increase trust in data
- Not only can you trust that your data documentation is being updated as your data itself is, but you’ll also be able to rely on certain pieces of information being recorded across the board. This streamlines the mental load of figuring out what should be documented, and makes it easier for people outside of your data team to understand the documentation.
- Scales as the data and company scales
- Your data documentation process should grow and account for growth. When documentation is automated, there is no need for a massive overhaul or constant iterations on the documentation. You can trust that your documentation grows alongside you and your data.
- Empowers data organizations to become proactive instead of being reactive.
- A good automated data documentation tool will start making suggestions based on your current set up and what context you provide those who are reading the documentation. For example, if a new table is created that has similar columns or descriptors as another, it might ask if you’d like to categorize this in the same bucket as the other, existing table.
How to Automate Data Documentation with Secoda
Secoda makes the daunting task of automating data documentation easy. The following features take all of the manual work out of data documentation and require little onboard time:
Using metadata to create documentation automatically
Secoda uses metadata to automatically record things like:
- When the table or database was last documented
- Who it was documented by
- Queries run against that table
- Table descriptions
- Column descriptions
- Column level lineage
- Table level lineage
- Column profiling
- Metrics– this is done by taking queries that are built into the metrics layer and documenting them automatically
Auto documentation of users and their interactions with the data
- Every Secoda user gets a profile, and within each profile, the Secoda workspace owners are able to see which team members are interacting with which resources. This helps them understand which resources are irrelevant or stale.
- Auto documentation of user interactions also makes triaging and support easier– for example, if someone outside of the data team is struggling or has a question, the data team can simply pop into their profile to see which table or database they might be having issues with.
- Smart suggestions
- Data tags or table descriptions on existing data are taken into account, and when new, similarly structured data is added to the Secoda workspace, users are prompted to tag or add context to it based on the existing data. For example, if a table is tagged as containing PII (personally identifiable information), data that is added to another table that contains similar characteristics as the PII tagged data may be flagged to the workspace owner as potentially PII.
- The “magic wand”— this feature flags similar and downstream columns, allowing you to propagate the descriptions of one column to all the columns that match its name.
What does the future of data documentation tooling look like?
Companies are becoming more reliant on data to make decisions in all departments– and entire teams, like business intelligence teams, are dedicated to using data insights and analysis to guide these decisions. This increase in focus on data means that data teams need to focus on building processes and systems that support scaling. In order to do so, documentation and data knowledge needs to be standardized, and better yet, automated.
In the future, it’s likely that more and more data documentation tools will adopt an automated approach similar to Secoda, or that data teams will look into building their own automated systems. However, few if any tools currently make documentation as easy as Secoda does.
If you’re looking for new ways to make your data documentation seamless and low lift for your team, consider booking a demo to see how we can do that for you.