Solving data discovery starts by getting on top of your teams data debt. Data debt is a type of technical debt that is created when teams don’t catalogue, clean and categorize or organize their data. It drags down productivity and costs the organization in compute costs.
The best way to stay ahead of data debt is to adopt a proactive data management tool, which lets you know when and what important data is stale, muddled or undocumented. When there is a large amount of data debt, the right resources are difficult to find, manage and understand.
This costs teams time and effort, but too many times, data debt is difficult to measure. While this problem may seem like a small nuances in the short term, it can add up in costs directly and indirectly.
In this article, we will introduce a new way to measure the financial impact of data debt and poor data discovery on your business.
While speaking with data engineers and analysts, most voice their concern for dark, muddled, decayed and undocumented data, but the problem is that most don’t know where to start and how to measure the cost of data debt.
The problem leads organizations to spend more on their compute costs and costs organizations indirectly as more people rely on the data team when they need answers.
Although the direct costs of storage have decreased and will likely continue to decrease, they are still a priced resource that teams should try to optimize. Running jobs to update tables that are not used costs the organization direct compute resources that have no benefit. Additionally, Collecting unused, undocumented, muddled data makes it much more difficult to find the right information.
Today, most teams don’t evaluate their data debt. Instead, they continue to collect data and dashboards, regardless of their value to the organization. Decreasing data debt will decrease technology costs and significantly increase productivity.
The problem is that organizing data to make it more discoverable by lowering data debt is a costly task that can feel like a never ending challenge. This makes it difficult to justify the cost of a data discovery project.
One customer we interviewed said “It’s obvious to us that discoverable data is important to the business, the problem is that the second we finish organizing and documenting data, it’s already outdated. Measuring ROI and justifying the cost of this type of project is difficult when the result is unclear.”
Instead of working with the rest of the team to keep the company data clean and documented, the customer chooses to use confluence documents and hack together solutions that may solve the problem temporarily. More often than not, data teams are seen as cost centres. Budget and resources are allocated to exciting projects, which force data teams to lean on each other as they try to make the data more accessible.
This anecdote is a common one shared across many data teams. Justifying and measuring ROI of data discovery is difficult.
Because of this, we’ve been thinking about how data teams should think about measuring the cost of data debt across their organization. We’ve found that the following metrics can help data teams justify the cost of the project to management:
- Discovery time
- Organization time
- Business user productivity
Data discovery time is the amount of time that it takes for your data engineering and analytics team to find the right data, understand what it means and use it to analyze the data request. We think about this metric similarly to the way customer support teams think about resolution time. It is the time that it takes to go from ticket request to answered request.
Self service analytics tools like Looker and Tableau have helped reduce resolution time for data requests. This investment has helped data teams focus on the complicated queries and analysis. The problem is that BI tools don’t solve this problem alone. Data teams still spent about 30% of their time trying to find and understand which data they can trust and how to use the right data. The data discovery workflow is the biggest bottleneck for data teams who want to enable self service for product, marketing and customer support teams. Without this missing information, BI tools become limited in their ability to service all requests. Without a good data discovery tool, the resolution time can increase and take up a majority of an analysts time and effort.
For many data teams, data resolution time is measured in days or weeks (if measured at all). This is because data teams know that this inefficiency exists but have little to do to solve it. But most data teams know that the time spent on data discovery is incredibly costly to the business for two primary reasons:
- All the business decisions that take more time to answer become costly as key decision makers wait on the right information.
- The more time that passes between a request and and insight, the more likely it is that the requester will have moved on to the next task, making the insight less relevant and the value of the data team less clear.
But before jumping into tools that solve this problem, it’s important to understand the cost of data organization and data discovery. Traditionally, teams spend about 25% of their time finding data. Teams should also have an idea of how much time they spend cleaning and documenting data on a given week.
Next, the data team should consider measuring the amount of time spent organizing data. This is the amount of time spent cleaning, documenting and organizing data to make it legible for other employees. The more hours spent organizing the data can lead to faster data resolution time, but this is not always the case. Many teams find that organizing the data is a never ending battle. As more data is collected, it becomes harder to organize older data sets. Below are a few important strategies teams should consider when trying to reduce the amount of time spent organizing the data:
- Onboarding a data catalogue that automatically updates metadata based on the data. This is a much more efficient way to organize the data compared to Confluence
- Using a proactive data catalogue tool that reminds employees to document, clean and remove stale data. Ideally, this proactive layer is integrated into PagerDuty, Slack, Jira or any other workflow management solutions you use.
- Having a tool that is collaborative and gives multiple stakeholders the ability to document and manage data in a central place.
By staying on top of outdated, undocumented and stale data, teams can make it much easier to find the right information faster. Having a tool that makes managing the data easier is one important step that teams usually overlook.
Business user productivity
Many business users have interest in some data but don't know if it exists and who to approach. Having a data discovery tool might boost their productivity. Most tools that store data are built for the technical team, and therefore, make it difficult for business users to access and use data to make decisions. Improving the performance of business users through better data discovery can help all employees on the team make decisions faster. This also removes the burden on data teams, which can give them time to focus on more important tasks.
Measuring Data debt
You can measure the financial impact of data debt by looking at how much money it costs your team to discover and organize the data.
The hourly cost is the average cost used to represent the data team’s time spent per hour of discovery or data organization. Assuming an average cost of $75k annually and a team of 10 data engineers and analysts, 15 hours spent on data discovery and organization data per week, per analysts, a team could expect the cost of data debt to a business could exceed $300,000/year.
The impact of data debt goes varies significantly on the amount of data collected, the amount of employees who access data on a regular basis and the state of data documentation and management that exists in the organization. Additionally, companies might notice that the cost of data debt varies depending on business activities. For example, if a business is hiring or going through a business model change, they might notice that the cost of data debt is more painful than when they are in a steady state.
It also is important to remember that the cost of data debt does not factor in the opportunity cost of organizing or discovering data or the cost of using the wrong data. Having analysts who don’t have to spend as much time finding the right information and engineers that can manage data easily can help teams take on more projects that help demonstrate the value of the data function. Additionally, having data that employees can trust is a massive benefit to an organizations productivity.
By having an idea of the cost of data debt, teams can more easily calculate their return on investment for a data discovery tool. Without the existing baseline, it’s much more difficult to get buy-in from the managers controlling budget for your team.
Just imagine a world where you could tell your manager that adding another data analyst and purchasing a data discovery tool can help our team reduce the average monthly time organizing and discovering data by 200 hours, which translates to approximately 300K in yearly savings and an ROI of 2.79x a year.