Are we expecting too much?
Welcome to the postmodern era of the Modern Data Stack–we’ve reached the point now where we look back and reflect on, not only how far we’ve come as an industry, but also how far we still have to go. The data industry has witnessed remarkable advancements over the past 10 years (with ever more impressive ones over the last 5 years). Data teams are able to work faster and build high-quality data products more easily than ever before. However, as with any innovation, we’ve traded old problems for new ones, as different challenges have emerged, especially when it comes to scaling and maintaining a data team effectively, using a modern data stack.
I’ll start by acknowledging that this topic is not new, or even remotely fresh. A lot of very smart people have shared a lot of really great content about the challenges we face, and how we might see this space evolve over the years.
I’m going to try and take it in a bit of a higher-level direction, and summarize some of the more conceptual problems, and hopefully, provide some consideration about potential solutions.
First off, are we really being realistic?
Data teams, and their stakeholders, always seem to want more (and more, and more). We want to move fast, but also build high quality data products, and also do it for an affordable cost. If you’ve seen this venn diagram below before, then maybe you know, this is generally not considered possible.
While I agree that most modern data teams:
- have been living in a world of unbridled possibility, and only as of late, have been pushed to show their impact, value, and ROI, and…
- can be doing a lot more to improve their cost containment, data quality, and impact
We can’t expect this to go on indefinitely. Data teams need to think about balancing challenges with the different phases they’re in.
That being said–two things can be true: 1) expectations of modern data teams can be unrealistic, and 2) modern data teams can still be striving to innovate and overcome the common challenges they face.
Let’s dig into it a bit more.
Core challenges of scale in the Modern Data Stack
1. “The Modern Data Stack Killed Data Governance”
I host a monthly meetup here in Toronto called the Toronto Modern Data Stack (check us out if you’re in Toronto), and at a recent event, Teghan Nightengale shared a harrowing tale of an analytics engineer who thought they had done everything right, and yet, were still met with a 21 hour long run time of their dbt project. The catch? This story was about the internal team at dbt labs!
The argument here was that, while the modern data stack offers numerous advantages, one downside is that it has diminished traditional data governance. Tools like dbt are amazing to empower data teams to move faster and more efficiently, but it often comes with a cost. Costs such as model proliferation, missing or stale documentation, and increased warehouse usage. Without robust data governance, data teams are struggling to maintain data integrity and compliance, leading to inefficiencies and potential risk. If dbt labs themselves are facing these kinds of issues themselves, what are practitioners meant to do…?
2. SaaS Sprawl = Data Sprawl = a Bad Time
With the proliferation of Software-as-a-Service (SaaS) tools in the modern data landscape, there is a tool for just about everything you need. This has its perks, but also results in tool cruft and data sprawl. Each tool generally houses its own data, leading to fragmented data management. Managing and integrating data from various sources becomes increasingly complex, and keeping track of how data flows across the entire stack and system can be really challenging. As the tech stack sprawls, it becomes increasingly difficult to know how your upstream changes will affect downstream dependencies.
3. Data management is still too often manual admin work
Despite great advancements in the areas of data management and governance, far too often, tools in this area remain a manual step in the data team’s workflows. Maintaining these tools can create significant overhead for data teams, and rolling these initiatives out across a company can take a long time. These efforts can be tremendously valuable to speed up company alignment, resource usage, and impact driven from data initiatives, when invested properly. Unfortunately, these efforts are also often the first to be cut during difficult times as the ROI can be hardest to measure.
4. AI Can't Solve Everything
While advancements of generative AI and LLMs will undoubtedly play significant roles in the industry over the coming months and years, I don’t agree that they will be a panacea for all data challenges (I’m not alone in this), or put specific data roles out of work. For these efforts to be reliable and accurate, data teams need to build a strong foundation of data quality and invest in generating high-quality metadata. Properly managed metadata enables AI tooling to deliver better data discovery, lineage tracking, and overall data governance. And as we saw in point 1 above, we’re far from figuring that part out. I do think data roles will change with the help of generative AI, but hopefully this will mean saving our brains for complex human problems that generative AI is not so great at solving (an augmentation, more than a replacement).
Addressing problems of scale in a post-modern data stack world
1. Automate Enforcement of Data Quality and Project Quality
To maintain data integrity at scale, data teams should think about implementing automation of data quality checks throughout their data pipelines, at a foundational level. This builds a strong foundation for which you can build and scale upon. For example, leveraging things like pre-commit hooks to enforce coding conventions and dbt project quality can help to validate developers changes and identify discrepancies early on. This reduces the burden on your team’s human resources to enforce best practices, and relies on automation for consistency and speed.
2. Leverage Tools That Help Avoid Data Sprawl
Generally, investing in data management platforms, data observability, or data catalogs are often seen as a “nice to have” for data teams. Something you think about adding after you’ve reached the “holy grail” of impact and value. However, these tools can significantly improve your understanding of the size and scope of your data assets, and drive more impact from what you have already created. They allow non-data team users to discover, understand, and access data efficiently, which increases the impact of the assets you have created. In addition, they can help you understand your most and least used pipelines and assets, so you can invest in proper optimization and timely deprecation. This supports data team cost containment and reduction.
3. Invest in Metadata Quality Management
For generative AI tooling to be successful, data teams must prioritize building robust metadata to provide business context and knowledge about their domain area. High-quality metadata enhances data discovery, improves generative AI responses, and helps facilitate data governance. Implement metadata standards and automate metadata generation wherever possible.
4. Monitor the Performance of Your Team and Tech Stack
Ensure that you have established KPIs and metrics for your team and your tech stack (e.g. what are your sprint velocity, transformation pipeline run-time, and warehouse costs? How are these changing over time)? Building comprehensive monitoring and reporting about your own team and tech stack will help you to gain insights into areas of strengths and weaknesses in a more subjective manner. These insights help identify bottlenecks and help you plan larger initiatives to improve performance.
5. Enforce cost containment methods early and often
Regardless of what kind of budget you’re on, if you’re building a modern data stack in 2023, cost containment should be on your mind from day one. Ensuring you’re implementing things like Slim CI and fresh production rebuilds is critical (this is pretty straight forward with dbt Cloud, but can also be built with dbt Core and some light orchestration). Leveraging incremental models can vastly improve dbt project runtime and reduce warehouse costs. Choosing the right ETL solution for your needs can significantly impact cost (e.g. compare Fivetran’s MAR pricing model vs. Portable’s cost per connector offering. There are also open source solutions). Regardless, measuring and monitoring your cost regularly (the team at SELECT built a great dbt package for this to measure Snowflake) is critical so you can be prepared for how this will grow over time.
We’ve settled into the “postmodern” era of the modern data stack, which requires us to have a hard look at some of the problems data teams are still facing.
While data teams may never achieve all three (Good, Fast, and Cheap), we can certainly keep innovating and chasing it. Hopefully the information above provides a good summary of some of the main challenges facing teams using the modern data stack today, and some ideas of how to overcome them.