Staying ahead of data debt
April 19, 2021
Stay ahead of data debt
TL;DR: data debt is a species of technical debt that is created when teams don’t catalogue, clean and categorize their data. It drags down productivity and costs the organization in compute. The best way to stay ahead of data debt is to adopt a proactive data governance tool, which lets you know when and what important data is stale, muddled or undocumented. Seocda is a tool that makes it easy to stay ahead of data debt. We welcome you to try it out and join our slack!
Problem: Data tables were once created but unused by anyone in the organization either in direct queries or in the organization’s BI tool
We created a table a while back and it might be useful but it’s tough to tell. Maybe this table was created by someone that left the organization and now it’s continuing to be maintained for no reason. Although storing data has become cheap, it is not free. There is no benefit to updating and storing tables that no one uses. Since creating the tables, you’ve added two teams, six data analysts/engineers and tons of tables in your data warehouse. You know you need to conduct a spring cleaning, but the job is so daunting you don't know where to start.
Over the time that we have been in the data space, we have noticed there are four distinct phases to the data maturity of an organization.
- In the first phase, a team doesn’t collect data at all. This phase is all about trying to convince customers to pay for your product.
- In the second phase, an engineer starts using product analytics to track simple product data through a no-code BI tool like Mixpanel.
- In the third phase, teams start to think about using their data properly and bring on a data team to organize their warehouse and standardize their dashboards.
- In the fourth phase, teams start to scale and realize that a lot of the practices that worked before are undocumented and poorly managed.
We believe that teams should consider data management and governance tools a part of the core data infrastructure. Doing this can help teams save costs indirectly and directly. Additionally, teams can avoid a looming problem that is difficult to solve once it’s too late; data debt.
What is data debt?
There’s a shift that happens in most companies that want to become data-driven. This shift requests companies to spend more money hiring data engineers, data scientists and analysts. Once the data team is hired, they work as fast as possible to make sure that the business team is using the right information to make decisions. This pace and pressure creates data debt.
This is true of larger organizations as well. Without the proper governance and data management tools, teams struggle to tackle their growing data debt.
We learned that most teams put this process of today not because they don’t want to, but because they feel like the work done to document and manage data is never complete and always outdated.
The problem with putting data governance off is that it creates inefficiencies that compound over time, making it more difficult to tackle in the future. When teams don’t have proper data governance, onboarding new data analysts becomes difficult and costly. Teams lose sleep over the data sprawl and the overwhelming thought of cleaning it.
This is what we call data debt. It’s when you have undocumented, unused, incomplete and inconsistent data.
Data debt becomes a problem for teams much earlier than most teams realize. When teams adopt a self-service model of delivering insights to business users, and their data debt is not solved, data teams could risk spending time managing reports no one uses and producing data that no one understands.
The 4 D’s of data debt
Dark data is data that hasn’t been catalogued or categorized. Most teams start to tackle data debt by writing down their documentation in Confluence. We wrote this article to help teams identify when they should reconsider their documentation tools. The problem with Confluence documentation is that as soon as the data has been documented, it is out of date. Having your documentation in a central place is not enough. That central source of truth should also notify you if documentation or schema changes and what commonly used resources are missing documentation.
Duplicate data is usually a partial copy of a primary data source. Duplicates make it difficult to keep track of the source of truth and should be removed if they are not impacting any tables or dashboards.
Poor quality is a measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date. Not knowing these important metrics or having poor data quality is a factor of data debt.
Data assets that are sitting in your warehouse or BI tools that are not getting used both directly or indirectly should be considered “decayed”. Instead, many teams spend time managing reports no one uses and producing data just sits in the warehouse.
Why should teams care?
This problem costs organizations directly and indirectly. The direct costs are related to data storage and compute. Although storing data has become cheap, it is not free. We strongly believe that there is no benefit to updating and storing tables that no one uses. Data storage costs have decreased and will continue to drop, but compute is still a widely value and price resource. Running jobs to update tables that are not used costs the organization direct compute resources that have no benefit.
Additionally, collecting unused, undocumented, muddled data makes it much more difficult to find the information that they need. If there are poor naming conventions, documentation and ungoverned dashboards, how can an employee know which one is trustworthy and correct? Within the warehouse, how can a data analyst on another team know which table to look at to build reports on?
Today, most teams don’t evaluate their data debt. Instead, they continue to collect data and dashboards, regardless of their value to the organization. Some organizations conduct ad hoc audits to wide old dashboards and tables, but this is difficult to do across the entire organization and consistently. For most companies, decreasing data debt will decrease technology costs and significantly increase productivity. At Secoda, we’re trying to help data teams tackle this messy problem.
Why is this a difficult problem to solve?
It is easy to query a data warehouse and get tables that aren’t queried regularly. However, just because a table is queried doesn’t mean that it is being used. The table could be updated daily, but never be pulled into the BI tool. The table could be updated daily but also be directly used by another downstream task that is pulled into the BI tool.
To understand the full picture of what data is used and what isn’t, we have to pull from and understand: data warehouse query logs, GitHub code/data lineage, and BI tool data sources and dashboards. It is hard for organizations to prioritize decreasing debt when doing it right isn’t easy.
Additionally, this is difficult because the target is constantly moving. That is why we believe that a system to solve data debt should not only automatically catalogue data, it should proactively recommend the highest priority data organization tasks on a weekly or monthly basis. This can help teams tackle the most important issues related to their data origination in a piecewise approach, instead of trying to do it all at once.
Lastly, data debt is also difficult to quantify. Data debt is not something that teams explicitly measure. We believe that there will be a standard metric that can quantify the business impact in the future.
What can data teams do to stay ahead of data debt?
Teams that want to address data debt should start with an understanding of the complexity of the problem and the commitment to an ongoing solution. Bandaid fixes are only going to solve the problem temporarily. Instead, teams should adopt definitions that are understood and accessible to all members of the organization. Then teams should consider what the right type of data governance structure could look like for them. Some solutions could be standardizing data visualizations and removing unused reports, defining data dictionaries, adopting a data catalogue that alerts teams when things need documentation, and instituting data quality procedures.
What can Secoda do to help teams with data debt?
We wouldn’t raise this problem if we didn’t have a solution. We’re working on a framework to help teams stay ahead of data debt. We call this tool Secoda. The core product is a data catalogue tool that centralizes data from across the entire data stack. The tool is supposed to give data teams recommendations about which tables and dashboards are contributing most to the companies data debt.
We believe that the problem with most solutions is that they are passive tools. They don’t let you know when things change or recommend what needs updating. Secoda helps teams keep data documented and organized by giving data teams proactive updates every week. When connected to the entire data stack, Secoda is a powerful tool for data management.
Secoda is a very young tool, but it’s already used by over 100 teams across 20 countries. We’ve worked with our early adopters to build better tools in our community. We would love to welcome you and your team to our community and feedback cycle. Below is the workflow that Secoda streamlines for data teams:
1. Inspect your current state
Integrate your data into Secoda to see the state of data debt across the organization.
2. Analyze your data debt
Secoda will summarize your current state of tracking and highlight current issues such as the tables and dashboards are undocumented, unused, muddled.
3. Prioritize your data debt
Secoda recommends which data is a top priority to fix and gives you tasks that can make the biggest impact. Data teams get a weekly report with the most important changes and how they will impact the companies data debt.
4. Fix important issues
Secoda gives teams a simple dashboard to manage their data. Add documentation, remove tables and collaborate with other employees to stay ahead of data debt. We created a calculator to help teams measure the cost of their data debt and the ROI of a data discovery tool, which you can find here.
We’re confident that we’re close to building something special for data governance and we’re excited to share Secoda with everyone looking to benefit from data management at an early stage. So please ask questions and make suggestions by email or on Twitter. We want to know the good, the bad, and the ugly. We can’t wait to hear from you!