How to harness metadata for infrastructure health

Metadata is the future of effective data management across a modern data stack and is a necessary asset for scaling your data infrastructure. Learn more.
Last updated
October 17, 2023
Author

This article was co-authored by Lindsay Murphy, Head of Data at Secoda, and Timo Dechau, Founder of Deepskydata

We’re in the midst of an economic shift in the tech industry, which has put increased pressure on data teams to demonstrate impact/ROI while reducing costs. With many tools in the modern data stack already using (or are shifting to) consumption-based pricing models, it is very easy for data teams to get into sticky situations and accumulate substantial operating costs. Unfortunately, measuring the tangible revenue and business impact of their efforts remains elusive, causing many data teams to be perceived as cost centers rather than value creators.

By design, data teams often dedicate the majority of their time to making data available to other teams and functions, allowing for performance measurement and optimization. However, there's a critical irony here—data teams often neglect to measure and optimize their own performance in a data-driven way. This gap can be attributed to the manual and challenging process of collecting and measuring metadata from tools scattered across a modern data stack. Identifying gaps and inefficiencies across the entire system becomes a non-trivial task for already busy data teams.

The metadata silo problem

Metadata, when it is available, is often scattered across various tools we use in the modern data stack, which poses a significant hurdle for data teams interested in leveraging it to monitor and optimize team operations and infrastructure health. While the modern data stack (MDS) offers a plethora of tools for data generation, transformation, and consumption, few tools focus on helping data teams monitor and measure their entire data infrastructure comprehensively. Most existing solutions (for example, Elementary for dbt projects, or SELECT for Snowflake cost containment) only provide visibility into a singular tool within the stack, creating metadata silos and limiting our ability to identify how issues are connected and address them appropriately.

Moreover, forecasting the long-term performance and cost of running a modern data stack is challenging, with latent issues often remaining somewhat hidden until they escalate into major problems, such as slow-running dbt models, poor query performance, and ballooning warehouse compute costs.

Of course, This isn't a new problem. At it’s core nothing is new. Just a spin in a new context. Database Administrators were and are true masters when it comes to system tables and metadata. In DevOps metadata is essentially for any orchestration.

With the rise of the modern data stack, we embrace the decoupling of things, to give us more accessibility and more freedom to stack together a setup that does the job we need to be done. But the decoupling made metadata significantly more complex since we now have decoupled metadata in various places. 

Metadata shines in an integrated place like any data platform.

So this is the new context we have today, how can we use and bring together the decoupled metadata from the different kinds of places to achieve a similar experience we would have in an integrated data platform?

Standardizing and centralizing metadata

To address these challenges and elevate data teams to a higher level of maturity and scalability, data teams should consider investing more attention into building quality and centralized metadata assets. Metadata can provide critical insights into the health of a data team's infrastructure and operations.

Here are some key areas where metadata can make a significant impact:

  1. End-to-End Pipeline Performance: Comprehensive metadata that spans an entire pipeline from source to serving would include information such as source rows loaded, cost of keeping a source fresh, transformation model run time, model run cost, changes in up and downstream dependencies, and more. With this information, data teams can effectively monitor and optimize the operation of specific pipelines.
  2. Pipeline Lineage and Usage: Expanding from the previous point, overlaying usage (queries, runs, etc.) with pipeline lineage would help data teams identify their most business-critical pipelines, and ensure those are running as efficiently as possible, and also identify unused pipelines for deprecation. Understanding how internal BI tools and data are used is crucial. Metrics like queries, cost per query, dashboard views, and active users of BI tools can provide insights into user behaviour and help optimize resources. Aligning a goals framework (such as OKRs) with this information can also help data teams ensure company goals and OKRs are being properly measured with data assets (read our article on Integrating a Goals Framework into your Data Stack for more on this topic!)
  3. Data Team Delivery and Efficiency Metrics: Borrowing from the practice of DevOps, data teams could look at measuring DevOps Research and Assessment (DORA) metrics to measure and improve the team's efficiency and delivery. Metrics like lead time for changes, deployment frequency, change failure rate, and time to restore service can offer valuable insights.
  4. Overall Infrastructure Health: Metrics such as daily cost, number of pipeline runs, number of model runs, cost per run, percentage of test success rate, and more could serve as essential metrics to gauge the overall health of the data stack and infrastructure. Applying forecasting and predictive analysis to this metadata can help anticipate potential issues and take proactive measures. This can be a game-changer in ensuring the scalability of a data function.

The future of effective data management is rooted in metadata analysis

Metadata is the future of effective data management across a modern data stack. Data teams need tools that provide critical metadata easily, and ideally, in one centralized place. By harnessing the power of metadata and leveraging it for performance optimization, we can overcome the challenges that have long plagued data teams. With better visibility, proactive alerts, and predictive analysis, we can not only measure and optimize our own performance but also pave the way for the next chapter of data team scaling and maturity.

How to analyze and work with metadata

As described earlier, the benefits of metadata are significant and a necessary asset for scaling your data infrastructure. But data teams are still in the silo problem, where metadata is spread across different tools and the ease of accessing it can vary greatly.

So what does an approach to building a “metadata data stack” look like?

Let’s first talk about the types of metadata available in the data space: state and process metadata.

State metadata

This kind of metadata is classic snapshot data, and may be what you think about when you hear “data observability”. Metadata is about the classification of an object at a specific point in time. An object can be a table, a pipeline process, a column, or a query.

For example, state metadata for a table could include the number of rows, number of columns, ownership, creation date, last update date, location, and storage in GB. State metadata has a limited use case. It is helpful for specific kinds of metrics, like how much storage we use (which has an impact on the costs). If this metadata is saved as a snapshot it can enable over-time comparisons of sizes (table rows, runtimes).

When we talk about database state metadata, this is usually easy to get and often a part of system tables that can be queried.

Process Metadata

In our opinion, this is the more interesting type of metadata. If data teams only ever needed to load data once, and never change pipelines, maybe just query it from time to time, then state metadata would be comprehensive.

Unfortunately, our profession loves to extract, transform and load data, often with plenty of steps (and often copies of data) in between. What could possibly go wrong?

This is where we can find insights from process metadata that can help data teams manage their pipelines in a more informed way. So how do we there?

In most data stack setups, we analyze how process steps are developed over time, how they interact with each other and what kind of context data they produce. In the analytics space, we call these events. Their raw shape, but powerful context, enables plenty of analysis use cases - like pipeline analysis as a funnel. And if we can find common identifiers, we can even match these events across different data sources.

One way to work with process metadata is to treat it as event data. Every time something is happening across a data pipeline, it can trigger an event. Many data systems already have this data, either readily available for analysis or stored in logs.

A quick example: Snowflake and Google BigQuery have the INFORMATION SCHEMA views to provide you with metadata about your data warehouse. One type of view is the job view, which contains each job that was run within the warehouse - these are usually queries.

Here we get a nice range of useful data that we can build into an event like

Query started -> start_time, cache_hit, destination_table,..

We would even get the Query stages if we wanted to dig really deep.

These events can then be collected in a metadata event table and you can use any event analytics tool on top of them like Amplitude, Kubit, Netspring or Mixpanel to analyze them or just create metrics for internal monitoring and optimization of data operations.

What kind of metadata can we use to monitor and improve our data operations?

Costs

Moving into 2023, data teams are more focused than ever on containing costs. While in previous years, the cost of setting up a data infrastructure was often seen as an investment, today there is more pressure on data teams to manage or reduce their costs.

The total cost of running a data stack is an output metric that may need to be directly reported on. Output metrics are useful to know what happened, but are not helpful to provide insight into how to optimize cost.

Instead, we need to analyze the input metrics for cost–these are things we can actually control and try to improve:

Query time or Model run time - broken down by model or query

This gives us something to work with. When we start to invest in optimization like keys and partition, we can check the impact by monitoring and analyzing this metric.

Reporting access - broken down by dashboard, source_table and user

Naturally, we want people to access the data through dashboards or direct queries, but our setup might be built for an ideal case. We can use this data to identify unused dashboards. source tables, and also inactive users.

Time

Time of course correlates with the costs to some degree, at least in the CPU usage time-based parts of our stack. But the time impact is bigger than just costs. Specific data needs a specific time of arrival.

Therefore total pipeline runtime or core data is available at metric is a relevant one.

As an input metric, we could look into the:

Pipeline run time - broken down by pipeline 

Quality

Quality of the data is a broad field and will not end up with one output metric. It can span from simple data freshness to successful tests (and these can range from simple unique tests to anomaly detections).

Operations

These kinds of metrics give you insights into how your data team can evolve the data setup and react to issues. Typical metrics include cycle time (the average from ticket creation to deployment) or time to recovery (the average time from bug ticket creation to fix deployment).

The future

In the last 12 months, we heard something in data space that went a bit missing in the years before design (or data modelling, architecture,...). But all are in the end: How do we design a setup that gets us our current results, in the defined quality and can also scale and extend for future use cases? But how do we know if our design is still working or has some issues? We want to see this as early as possible, so we can adjust without doing huge refactoring every 12 months.

This is where metadata comes in to save us. With metadata monitoring, we can monitor the health of our data setup and optimize regularly (depending on the size and objectives of the setup: instantly, every week, every month or every quarter). Based on the findings we can add setup optimization tasks to our tasklist and then check the results in our metadata setup.

Metadata analytics enables us to improve specific parts of your data setup in a data-driven way with fast feedback cycles. And as with most newer technologies, this might be just scratching the surface. There are advanced use cases, like automated privacy or security audits, and automated self-healing of pipelines that we expect to come in the near future.

As data professionals, we should be focused on championing this cause for metadata-driven optimization, making it a cornerstone of our data strategy. Managing a modern data stack in 2023 can be pretty challenging, but with the right data at our fingertips, we can unlock new levels of efficiency, scalability, and value creation for our teams (and ultimately our organizations).

Keep reading

See all stories