Building a data stack in 2023 has never been easier. With the growth and advancement of data tools and capabilities over the last 10 years, even “non-technical” data people can quickly use cloud-based tooling to set up the basic needs of a “modern data stack”. From data collection, ingestion and storage, to transformation and serving, there are countless cloud solutions available to fill the needs of most data teams across the data engineering lifecycle. All of these modern tools aim to abstract away the complexity of building and maintaining data engineering infrastructure, making it easier and faster to get things done.
Unfortunately, we have traded this ease of setup and access to these tools for their (often) high costs. Building a completely cloud-based modern data stack can get very expensive, very quickly, and this problem only gets worse as data teams attempt to scale. So, while setup is easy in 2023, scaling remains an incredibly difficult and complex challenge, and arguably one that is only getting worse due to the stratification of all of the tools modern data teams are using.
In this article, we will dive deeper into these challenges and pitfalls of unchecked infrastructure costs. I will share some advice about the importance of early investment in a data cost containment strategy, and provide some tips to help you get ahead of ballooning infrastructure costs.
How did we get here?
It can be helpful to take a trip down memory lane to appreciate the amount of change that has happened in the data industry over the last 10 to 15 years. The “big data era” and subsequent excitement about the value of data-driven companies, created a boom in tech VC funding–leading to an explosion of data tooling available.
This growth has been great for data people in many ways–we have new roles available to us, we have better tools, and we’re able to work more like software engineers–becoming more efficient and honing our craft. All of this has allowed us to move faster and focus on delivering value (?), rather than reinventing the wheel by building internal data engineering solutions that are not competitive differentiators for their company.
But, in many ways, this explosion of change has created new challenges for us. We’re now facing the issues that come from moving so quickly with very little governance: spaghetti DAGs, growing complexity and dependencies, overlapping tool capabilities, endless backlogs of new feature requests for our data products, and, perhaps the most critical issue, rapidly growing cost of running infrastructure that often outpaces the value we can deliver.
The problem with ballooning costs
Unless you have been living under a (big) rock, you will have noticed by now that businesses aren’t operating the same way they were ~2.5 years ago. Across all industries, and especially in tech, many companies have had to significantly pivot their business models to focus on sustainable growth and profitability, rather than the “growth at all costs” approach that dominated most of the tech space for the last 10-15 years.
With this shift, businesses have been a lot more serious about measuring the ROI of various business functions, looking for places to cut budgets and reduce spending. Data teams are no exception to this trend–we have been facing unprecedented pressure to demonstrate return on investment (ROI) and overall value. But as we all know, “measuring the value of a data team”…isn’t that easy. Since “having a data team” became a thing, data teams and the businesses that employ them have been trying to answer this question of ROI, and for most, it has continued to elude us.
Unfortunately (for us), what isn’t as hard to measure, is the cost of a running data team. This is causing data teams to be seen as cost centres rather than value drivers–and with the swift shift in the economy, this is a bad place for data teams to be. As companies are forced to tighten their financial belts and look harder at what is driving profitable growth versus unchecked costs, we’re seeing data teams being laid off (a fairly uncommon occurrence before 2021).
Data Teams as Cost Centres
As Tristan Handy, CEO of dbt, called out in this article (which is now almost 2 years old), “We’ve historically had very hand-wave-y conversations about the ROI of data teams. Certainly, we aren’t about to start attributing dollars of profit to individual feats of analytical heroics—that shouldn’t be the goal. But the honest truth is that no organization can do this perfectly.”
Data teams have long faced challenges in demonstrating their ROI, and that was allowed to persist because the economic conditions allowed it, and the hype of data industry growth justified it (or at least distracted us from hard value discussions for a bit longer). However, as mentioned above, the current economic climate demands a more concrete demonstration of ROI across all business functions.
Unpredictability of Costs
To make matters worse, increasingly, cloud-based data tools and vendors are shifting toward consumption-based pricing models. This makes it increasingly easy for infrastructure costs to spiral out of control, and undercut any value you are driving (whether you can measure that value or not). Tools that once offered seat-based pricing are now either replacing or adding consumption-based pricing models, seemingly, to grow into their valuation and generate more profit (facing the same economic pressures that all businesses are).
Strategies to Balance cost vs. value
So, how can your data team manage this? I believe the biggest issue driving cost is that for most cloud-based tools in the modern data stack, the cost and the drivers are not clearly exposed. This leads to siloed data sources, multiple sources of truth, and difficulty generating insights about what is driving costs–the very same issues the modern data stack strives to solve for other business functions.
Make cost containment a top priority
Data teams need to make ongoing cost containment a top priority, not only among their team but also with data producers and consumers. Building internal data assets (like a cost containment monitoring and alerting dashboard) will allow you to leverage data to understand and optimize data stack costs proactively. Gathering costs from sources, such as ETL logs and packages like dbt_snowflake_monitoring, will allow data teams to centralize and inspect more granular data about costs and their drivers.
Have a cost plan and budget
Building a plan for cost containment early on is critical. Data teams should weigh their options and decide what they should buy, what they can use open-source solutions for, and what the appropriate resource investment in each category should be (e.g. how much do you want to spend on ingestion, compute, transformation, etc). Consider costs you will incur by their functional area (data generation, ingestion, storage/compute, transformation, orchestration, and serving), as well as by their pricing model (variable vs. static costs). Setting a monthly budget and keeping track of whether you have remained within it will ensure you’re staying on top of costs and they’re not slowing increasing, unchecked.
Build safeguards to control variable costs
Wherever possible, data teams need to be setting budgets and controls on variable cost tools. In Snowflake, for example, you can set up resource monitors to automatically suspend the warehouse if it exceeds a certain threshold of credit usage. Especially for ELT tools, like Fivetran, MAR can jump significantly and unexpectedly based on changes made to source systems by data producers. Fivetran provides no ability to set resource controls or safeguards, so choosing a credit card with a low limit may be the only way to protect yourself while you work on disputing costs.
For credit-based ETL tools (such as Airbyte or Matillion), you can get a bit more creative but purchasing a set limit of credits per month, and if you exceed those credits, the tool will suspend loading (another good option while you investigate unexpected loading usage spikes).
As for serving costs, be sure to limit access of BI tools to any raw data sets, or large, wide tables, instead only exposing business users and data consumers to transformed and ready-to-use data. Also, materializing BI models as tables (rather than layers of views on top of views, that take forever to load) will not only improve the end-user experience but will also ensure you’re not running up compute costs from the BI layer.
Negotiate static costs
Static costs are things that don’t change based on usage or consumption (this can be seat-based tools, or annual contracts you are locked into). For tools you know you will need and are non-negotiable, it can make sense to negotiate longer contract terms to recoup some savings (e.g. Redshift allows you to purchase reserved nodes, which can translate into as much as a 76% discounted rate on your annual warehouse costs). Product analytics or BI tools can also be a good place to get savings with longer contract terms.
Educate data team, producers, and consumers on cost containment best practices
Educating everyone on the basics of cost containment, and making it an explicit part of the team's job description, can help build a culture of responsible cloud spending. If your analytics engineers don’t know how to optimize a query, this is a critical piece of training to prioritize. There should also be an effort to educate data producers and consumers on what leads to cost spikes to drive accountability (changes in source historical data, schema changes requiring full table refreshes or resets, and BI/self-serve analytics usage patterns).
Build cost containment efforts into day-to-day workflows
This is an area that I believe has huge potential to drive in-the-moment cost optimization–I would love to see data teams building cost-containment insights into their day-to-day work. For example, with the outputs of something like the dbt_snowflake_monitoring, data teams could add changes to a model’s performance and cost to a PR review (“cost-diffing”), allowing themselves and their code reviewer to understand how performance is changing in small, manageable increments (rather than waiting until an overall pipeline slows and causes harder to resolve systemic issues).
I also think having a monthly insights discussion about cost drivers and potential optimizations, amongst the data team, but also including critical data producers and consumers, will help drive accountability among those incurring costs. Talking about costs directly and frequently with your data team is not very normalized (in my experience)–I think this is critical to drive accountability and avoid spiraling costs.
Scaling costs responsibly is possible, but requires some groundwork
Running a data team on a budget while scaling is a complex task that requires careful planning and continuous monitoring. Data infrastructure is not a "set-it-and-forget-it" exercise; it demands effective monitoring and alerting with metadata to identify issues early on. Integrating cost containment processes into day-to-day workflows (e.g. adding cost-diffing in PRs, hosting monthly data budget review meetings with data team, producers, and consumers), and driving accountability for cost containment (e.g. showing costs associated to specific users or departments) can be critical pieces of implementing a cost containment strategy to help data teams can maximize the ROI to their organizations.
Monitor your data stack with Secoda
Secoda is the only data management platform to give you visibility into pipeline metadata such as cost, query volume, and popularity. Monitor and optimize the health of your pipelines, processes, and data infrastructure as a whole. WIth Secoda, you can reliably scale your team's compute costs and simplify your stack. Learn more here